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(57) ABSTRACT 

A processing system is disclosed. The processing system 
includes at least one cache and at least one scratch pad 
memory. The system also includes a processor tor accessing 
the at least one cache and a I least one scratch pad memory. 
Hie at least one scratch pad memory is smaller in size than 
the al least one cache. Hie processor accesses the data in the 
at least one scratch pad memory before accessing the at least 
one cache to determine if the appropriate data is therein. 
There are two important features of the present invention. 
'Hie first feature is that an instruction can be utilized to fill 
a scratch pad mem£ry_with the .^appropriate data _in__an 
efficient manner. 1-he second feature is that once the scratch? 
s^pad' has" "tffiPappr.opriale- .dala,_it._can_ be accessed morej? 
effidejutyao' retrieve this data within the cache and memory 
space not needed for this da t a r^M^lias'al.p articular, advan^" 
{Jal^fciLfrequentiy^ 

(Algorithm- to .minimize-the-amount-oLspace^utilized_inJhe^^ 
^ache'foljIuchXo^ the complexity of the 

cache is not required using the scratch pad memory as well 
as space within the cache is not utilized. 

3 Claims, 3 Drawing Sheets 
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SCRATCH PAD MKMOKIKS 



DETAILED DESCRIPTION 



FIELD OF THE INVENTION 

'Hie present invention relates generally to ; 
system and more particularly to a processing 
includes a scratch pad for improved performance. 

BACKGROUND OF THE INVENTION 

Processor architectures are utilized for a variety of func- 
tions. FIG. 1 is a simple block diagram of a conventional 
processing system 10. The processing system 10 includes a 
core processor 12 which controls a system bus interface unit 
18. The core processor 12 also interacts with an instruction 
cache and a data cache. Typically, the core processor 
retrieves information from the data cache or the instructions 
for operation rather than obtaining data from system 
memory as is well known. Since the data cache and instruc- 
tion cache are smaller in size, data can be accessed from 
them more readily if it is resident therein. 

In this type of processing system, oftentimes small rou- 
tines are provided which can further affect the performance 
of the system. Accordingly, the caches are placed therein to 
is allow faster access rather than having to access system 
memory. Although these caches are faster than system 
memory, they still are relatively slow if the routine needs to 
be accessed on a continual basis therefrom. For example, 
small routines may take up several cycles which can become 
a performance bottleneck in a processing system. So what is 
desired is a system which will allow one to more quickly 
access and obtain certain routines and therefore improve the 
overall performance of I he system in the data cache without 
wasting memory space. 

The system must be easy to implement utilizing existing 
technologies. The present invention addresses such a need. 

SUMMARY OF THE INVENTION 

A processing system is disclosed. The processing system 
includes at least one cache and at least one scratch pad 
memory. The system also includes a processor for accessing 
the at least one cache and at least one scratch pad memory. 
The at least one scratch pad memory is smaller in size than 
the at least one cache. 'I "he processor accesses the data in the 
at least one scratch pad memory before accessing the at least 
one cache, to determine if the appropriate data is therein. 

There are two important features of the present invention. 
The first feature is that an instruction can be utilized to fill 
a scratch pad memory with the appropriate data in an 
efficient manner. The second feature is that once the scratch 
pad has the appropriate data, it can be accessed more 
efficiently to retrieve this data within the cache and memory 
space not needed for this data. This has a particular advan- 
tage for frequently used routines, such as a mathematical 
algorithm to minimize the amount of space utilized in the 
cache for such routines. Accordingly, the complexity of the 
cache is not required using the scratch pad memory as well 
as space within the cache is not utilized. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. I is a simple block diagram of a conventional 
p ro c ess i n g system. 

FIG. 2 is a simple block diagram of a system in accor- 
dance with the present invention. 

FIG. 3 is a diagram of a register utilized for a scratch pad 
in accordance with the present invention. 
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The present invention relates generally to a processing 
svslem and more particularly to a processing system that 
includes a scratch pad for improved performance. The 
processing ? following description is presented to enable one of ordinary 
system that skill in the art to make and use the invention and is provided 
in the context of a patent application and its requirements. 
Various modifications to the preferred embodiment and the 
generic principles and features described herein will be 
readily apparent to those skilled in the art. Thus, the present 
invention is not intended to be limited to the embodiment 
shown but is to be accorded the widest scope consistent with 
the principles and features described herein. 

FIG. 2 is a block diagram of a system 100 in accordance 
with the present invention. Those elements that are similar 
to those of FIG. 1 are given similar reference numbers. As 
is seen, a scratch pad memory 102 is provided for the 
instruction cache and a scratch pad memory 104 is provided 
with the data cache 16'. The scratch pad memories 102 and 
104 are typically 2 Kb in size as compared to a 8 Kb data 
cache and 8 Kb instruction cache. In a preferred 
embodiment, the scratch pad memories 102 and 104 have 
the highest priority when accessing data. A state machine 
112 is coupled to the instruction cache 14 ! and data cache 16' 
and interacts with scratch pad memories 102, 104 and a 
register 114 with the core processor 12'. The state machine 
112 provides access to the register 1.14 within the core 
processor 12'. 

FIG. 3 is a diagram of the register 114 utilized for a 
scratch pad memory in accordance with the present inven- 
tion. The register 114 includes an enable (E) bit 202 for 
enabling the scratch pad memory, a fill (F) bit 204 to fill the 
scratch pad memory and bits 206 for storing the base address 
; - for the instruction tat causes the filling of the scratch pad 
memory 102 and 104. 

There are two important features of the present invention. 
The first feature is that an instruction can be utilized to fill 
a scratch pa d memor y w ith th e appropriate data in an 
inefficient manner. The second feature is^liall>nce'ihe'scralc]i 
Lpacl has the appropriate data, it can be accessed more 
■ efficiently to retrieve this data within the cache and memory 
--space not needed for this data. This has a particular advan-.. 
lage for frequently used routines, such as a mathematical 
i5 -algorithm to minimize the amount of space uiilized-in-the- 
j cache for such routines. Accordingly, the complexily-oijhe 
' cache is not required using the scratch pad memory as well 

■l— as-space_w.ithin_ihc- cache-is not-utilized 

The operation of the present invention will be described 
50 in the context of the instruction cache 14' and its associated 
scratch pad memory 102 but one of ordinary skill in the art 
recognizes that the data cache 16' and its associated scratch 
pad memory 104 could be utilized in a similar manner. 
A system in accordance with the present invention oper- 
55 ales in the following manner. First the filling of the scratch 
pads will be described. Assuming there is a cache miss, then 
the data from system memory will be read, and the scratch 
pad memory 102 will be filled. The scratch pad 102 will be 
filled based upon an instruction resident in the register 114. 
60 In a preferred embodiment the enable bit is set to 1 and the 
fill bit is set to I to indicate that data can be loaded into the 
scratch pad memory. The core processor 12' reads the data 
from the base address range of the register 114 and this will 
be the data that will be provided to the scratch pad memory 
102. The stale machine 112 captures the event of writing into 
the register 114 and causes the system bus unit 18' to fill the 
scratch pad memory 102. When the scratch pad memory 102 
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is lilted, the pipeline is released by the processor 12'. 
Therefore, ihe scratch pad memory 102 then includes the 
routine (for example, a mathematical algorithm). Once 
released, then processing can continue, scratch pad memory 
106. 'Hie state machine 112 captures the event of writing into 
the register 1 14 and causes the system bus unit 16' to fill the 
scratch pad memory 106. When the scratch pad memory 106 
is filled, the pipeline is released by the processor 12'. 
Therefore, the scratch pad memory 106 then includes the 
routine (for example, a mathematical algorithm). Once 
released, then processing can continue. 

Next, the accessing of the scratch pad memory 102 will be 
described. Accordingly, when ihe particular routine needs to 
be accessed, first the processor 12 ! accesses the scratch pad 
memory 102 to determine whether the data is there. If the 
data is there, it can be read directly from the scratch pad in 
a more efficient manner than reading it from the data cache. 
This can be performed several times to allow the processor 
to allow for faster access to the data. If the data is not there 
then the processor accesses the data in the cache. If the data 
is not within the scratch pad memory or the data cache then 
the processor will obtain the data from system memory 21)'. 

Accordingly, through a system and method in accordance 
with the present invention a processing system's perfor- 
mance is significantly improved since data can be accessed 
more quickly from the scratch pad memory. In addition, the 
filling of ihe scratch pad memory can be accomplished in a 
simple and straightforward manner. 

Although the present invention has been described in 
accordance with the embodiments shown, one of ordinary 
skill in the art will readilv recognize that there could be 



variations to the embodiments and those variations would be 
within the spirit and scope of the present invention. 
Accordingly, many modifications may be made by one of 
ordinary skill in the an without departing from the spirit and 
scope of the appended claims. 

What is claimed is: 

1. A system for improving the performance of a process- 
ing system, the processing system including a processor and 
at least one cache, the system comprising: 

a scratch pad memory which can be accessed by the 
processor; 

a mechanism for providing the scratch pad memory with 
the appropriate data when the data is not within the at 
least one cache, wherein the scratch pad is smaller than 
the at least one cache and is accessed by the processor 
before the at least one cache; 

a register within the processor; 

a state machine for accessing the register when the scratch 

pad memory is to be filled; 
an instruction within the register for initiating the filling 

of the scratch pad memory; and 
a system interface unit for filling the scratch pad memory 

with the appropriate data. 

2. The system of claim .1 wherein the filling of the scratch 
pad is provided by a system memory. 

3. The system of claim 2 wherein the register comprises 
an enable bit, a fill bit and a base address for the instruction. 
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[57] ABSTRACT 

A CPU or microprocessor which includes a general purpose- 
CPU component, such as an X86 core, and also includes a 
DSP core. In a first embodiment, the CPU receives general 
purpose instructions, such as X86 instructions, wherein 
certain X86 instruction sequences implement DSP func- 
tions. The CPU includes a processor mode register which is 
written with one or more processor mode bits to indicate 
whether an instruction sequence implements a DSP function. 
The CPU also includes an intelligent DSP function decoder 
or preprocessor which examines the processor mode bits and 
determines if a DSP function is being executed. If a DSP 
function is being implemented by an instruction sequence, 
the DSP function decoder converts or maps the opcodes to 
a DSP macro instruction that is provided to the DSP core. 
The DSP core executes one or more DSP instructions to 
implement the desired DSP function in response to (he 
macro instruction. If the processor mode bits indicate that 
X86 instructions in the instruction memory do not imple- 
ment a DSP-type function, the opcodes are provided to the 
X86 core as which occurs in current prior art computer 
systems. In a second embodiment, the CPU receives 
sequences of instructions comprising X86 instructions and 
DSP instructions. The processor mode register is written 
with one or more processor mode bits to indicate whether an 
instruction sequence comprises X86 or DSP instructions, 
and the instructions are routed to the X86 core or to the DSP 
core accordingly. 

20 Claims, 11 Drawing Sheets 
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CENTRAL PROCESSING UNIT INCLUDING 
APXANJ) DSP CORES AND INCLUDING 
SELECTABLE APX AND DSP EXECU TION 
MODES 

CONTINUATION DATA 

'lliis is a continuation-in-part of application Sen No. 
08/G1 8,243 li tied "Central Processing Unit Having an XS6 
and DSP Core and Including a DSP Function Decoder which 
Maps X86 Instructions to DSP Instructions" and filed Mar. 
18, 1996, and which is assigned to Advanced Micro Devices 
Corp now U.S. Pat. No. 5,794,068. 

CROSS REFERENCE TO RELATED 
APPLICATIONS 

'Die following applications are related to the present 
application and are hereby incorporated by reference in their 
entirety. 

U.S. patent application Ser. No. OS/618,243, tilled "Cen- 
tral Processing Unit Having an X86 and DSP Core and 
Including a DSP Function Decoder which Maps X86 
Instructions to DSP Instructions 1 ' and filed Mar. 18, 1996, 
now U.S. Pal. No. 5,794,068. 

U.S. patent application Ser. No. 08/618.000, titled "Cen- 
tral Processing Unit Having X86 and DSP Functional Units" 
and filed Mar. IS, 1996, now U.S. Pal. No. 5,781,792. 

U.S. patent application Ser. No. OS/6.18,242, titled "Cen- 
tral Processing Unit Including a DSP Function Preprocessor 
Having a Pattern Recognition Detector for Delecting 
Instruction Sequences which Perform DSP Functions" and 
filed Mar. 18, 1996, now U.S. Pal. No. 5,754,878. 

U.S. patent application Ser. No. 08/618,241, titled "Cen- 
tral Processing Unit Including a DSP Function Preprocessor 
Having a Look-up Table Apparatus for Delecting Instruction 
Sequences which Perform DSP Functions" and filed Mar. 
IS, 1996, now U.S. Pat. No. 5,784,640. 

U.S. patent application Ser. No. 08/618,240, tilled "Cen- 
tral Processing Unit Including a DSP Function Preprocessor 
Which Scans Instruction Sequences for DSP Functions" and 
filed Mar. 18, 1996, now U.S. Pat. No. 5,790,824. 

The above related applications are all assigned to 
Advanced Micro Devices, Inc. 

FIELD OF THE INVENTION 

IVie present invention relates to a computer system CPU 
or microprocessor which includes a general purpose core 
and a DSP core, wherein the CPU includes a switch for 
selecting a processor execution mode to selectively enable 
processing of DSP instructions. 

DESCRIPTION OF THE RELATED ART 

Personal computer systems and general purpose micro- 
processors were originally developed for business applica- 
tions such as word processing and spreadsheets, among 
oi hers. However, computer systems are currently being used 
to handle a number of real time DSP-related applications, 
including multimedia applications having video and audio 
components, video capture and playback, telephony 
applications, speech recognition and synthesis, and commu- 
nication applications, among others. These real time or 
DSP-like applications typically require increased CPU lloat- 
ing point performance. 

One problem that has arisen is that general purpose 
microprocessors originally designed for business applica- 



2 

lions are noi well suited for the real-lime requirements and 
mathematical compulation requirements of modem DSP- 
related applications, such as multimedia applications and 
communications applications. For example, the X86 family 
5 of microprocessors from Intel Corporation are oriented 
toward integer-based calculations and memory management 
operations and do not perform DSP-tvpe functions very 
well. 

As personal computer systems have evolved toward more 

10 real-time and multimedia capable systems, the general pur- 
pose CPU has been correspondingly required to perforin 
more mathematically intensive DSP-type functions. 
Therefore, many computer systems now include one or more 
digital signal processors which are dedicated towards these 

15 complex mathematical functions. 

A recent trend in computer system architectures is the 
movement toward "native signal processing (NSP)'\ Native 
signal processing or NSP was originally introduced by Intel 
Corporation as a strategy to offload certain functions from 

20 DSPs and perform these functions within the main or 
general purpose CPU. The strategy presumes thai, as per- 
formance and clock speeds of general purpose CPUs 
increase, the general purpose CPU is able to perform many 
of the functions formerly performed by dedicated DSPs. 
Thus, one trend in the microprocessor industry is an effort to 
provide CPU designs with higher speeds and augmented 
with DSP-type capabilities, such as more powerful floating 
point units. Another trend in the industry is for DSP manu- 
facturers to provide DSPs that not only run at high speeds 

30 but also can emulate CPU-type capabilities such as memory 
management functions. 

A digital signal processor is essentially a general purpose 
microprocessor which includes special hardware for execut- 
ing mathematical functions at speeds and efficiencies not 
usually associated with microprocessors. In current com- 
puter system architectures, DSPs are used as co-processors 
and operate in conjunction with general purpose CPUs 
within the system. For example, current computer systems 
may include a general purpose CPU as the main CPU and 
include one or more multimedia or communication expan- 
sion cards which include dedicated DSPs. The CPU offloads 
mathematical functions to the digital signal processor, thus 
increasing system efficiency. 

^ Digital signal processors include execution units that 
comprise one or more arithmetic logic units (ALUs) coupled 
to hardware multipliers which implement complex math- 
ematical algorithms in a pipelined manner. The instruction 
set primarily comprises DSP-type instructions and also 

50 includes a small number of instructions having non-DSP 
functionality. 

The DSP is typically optimized for mathematical algo- 
rithms such as correlation, convolution, finite impulse 
response (FIR) filters, infinite impulse response (IIR) filters, 

55 Fast Fourier Transforms (FFTs), matrix compulations, and 
inner products, among other operations. Implementations of 
these mathematical algorithms generally comprise long 
sequences of systematic arithmetic/multiplicative opera- 
tions. These operations are interrupted on various occasions 

60 by decision-type commands. In general, the DSP sequences 
are a repetition of a very small set of instructions thai are 
executed 70% to 90% of the lime. The remaining 10% to 
30% of the instructions are primarily Boolean/decision 
operations (or general data processing). 

65 A general purpose CPU is comprised of an execution unit, 
a memory management unit, and a floating point unit, as 
well as other logic. The task of a general purpose CPU is lo 
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execute code and perform opera lions on da la in ihe com- 
puter memory and thus to manage the computing platform. 
In general, the general purpose CPU architecture is designed 
primarily lo perform Boolean/management/data manipula- 
tion decision operations. The instructions or opcodes 
executed by a general-purpose CPU include basic math- 
ematical functions. However these mathematical functions 
are not well adapted to complex DSP-type mathematical 
operations. Thus a general purpose CPU is required to 
execute a large number of opcodes or instructions to perform 
basic DSP functions. 

Therefore, a computer system and CPU architecture is 
desired which includes a general purpose CPU and which 
also performs DSP-type mathematical functions with 
increased performance. A CPU architecture is also desired 
which is backwards compatible with existing software appli- 
cations which presume that the general purpose CPU is 
performing all of the mathematical computations. A new 
CPU architecture is further desired which provides increased 
mathematical performance for existing software applica- 
tions. 

One popular microprocessor used in personal computer 
systems is the X86 family of microprocessors. The X86 
family of microprocessors includes the 8088. 8086, 80186, 
802S6. S0386, i486, Pentium, and P6 microprocessors from 
Intel Corporation. The X86 family of microprocessors also 
includes XS6 compatible processors such as the 4486 and 
K5 processors from Advanced Micro Devices, the Ml 
processor from Cyrix Corporation, and the NexiGen 5x86 
and 6x86 processors from Next Gen Corporation. 'I lie X86 
family of microprocessors was primarily designed and 
developed for business applications. In general, the instruc- 
tion set of the XS6 family of microprocessors does not 
include sufficient mathematical or DSP functionality for 
modem multimedia and communications applications. 
Therefore, a new XS6 CPU architecture is further desired 
which implements DSP functions more efficiently than cur- 
rent XS6 processors. It would further be desirable that this 
new XS6 CPU architecture did not require additional 
opcodes for the X86 processor. 

SUMMARY OF THE INVENTION 

The present invention comprises a CPU or microproces- 
sor which includes a general purpose CPU component, such 
as an X86 core, and also includes a DSP core. The CPU 
includes a switch for selecting a processor execution mode. 
'Die switch selectively enables processing of general pur- 
pose instructions, e.g., APX instructions, or DSP instruc- 
tions. In the preferred embodiment comprising an APX- 
based CPU. the CPU includes one or more bits, referred to 
as processor mode bits, that are set to indicate whether the 
instruetion decode engine should interpret the incoming 
code sequence as DSP instructions or APX instructions. 
Thus, for example, the processor mode bit is set to indicate 
a sequence of DSP instructions, and the processor mode bit 
is cleared to indicate that the program sequence reverts back 
to a normal APX mode of operation. The CPU may include 
other means for indicating or differentiating between APX 
and DSP instructions, as desired. The CPU includes a 
preprocessor" which examines the processor mode bit and 
selectively provides instructions to either the XS6 core or the 
DSP. 

In a firs l embodiment, the CPU receives only APX 
instructions. In this first embodiment, the CPU includes an 
intelligent DSP function decoder or preprocessor which 
examines sequences of APX instructions or opcodes (X86 



opcodes) and converts or maps the instruction sequence to a 
DSP macro instruction or function identifier that is provided 
to the DSP core. The processor mode bit is set to indicate the 
start of an APX code sequence which implements a DSP 
function. The preprocessor thus examines the processor 
mode bit lo determine if a DSP function is being executed. 
If the preprocessor determines that a DSP function is being 
executed based on the processor mode bit, the preprocessor 
converts or maps the instruction sequence to a DSP macro 
instruction or function identifier that is provided lo (he DSP 
core. The DSP core executes one or more DSP instructions 
to implement the desired DSP function indicated by the DSP 
macro or function identifier. The DSP core preferably per- 
forms the DSP function in parallel with other opera I ions 
performed by the general purpose CPU core for increased 
system performance. 

In one embodiment, the CPU includes a processor mode 
register which stores the processor mode bit, and also 
includes one or more bits, preferably a plurality of bits, 
which identify the type of DSP function implemented by the 
instruction sequence. Thus, the preprocessor examines the 
processor mode bit to determine if the APX code sequence 
implements a DSP function. If so, the preprocessor examines 
the plurality of bits to determine the general type of DSP 
function being implemented. The preprocessor uses the 
in form a i ion on the general type of DSP function in creating 
the function identifier, and the preprocessor also examines 
the instruction sequence to extract values and parameters 
necessary for the DSP core to implement the DSP function. 

In a second embodiment, the CPU receives an instruction 
sequence which comprises sequences of general purpose, 
e.g., APX instructions, and which also comprises sequences 
of DSP i nst met ions. The respective processor mode bit is set 
to indicate the beginning of a sequence of DSP instructions, 
and the processor mode bit is cleared to indicate the begin- 
ning of a sequence of APX instructions. The CPU thus routes 
the instructions to the APX core or the DSP core based on 
t he status of the processor mode bit. 

BRIEF DESCRIPTION OF THE DRAWINGS 

A better understanding of the present invention can be 
obtained when the following detailed description of the 
preferred embodiment is considered in conjunction with the 
following drawings, in which: 

FIG. 1 is a block diagram of a computer system including 
a CPU having a general purpose CPU core and a DSP core 
according to the present invention 

FIG. 2 is a block diagram of the CPU of FIG. 1 including 
a general purpose CPU core and a DSP core and including 
a DSP function preprocessor according to the present inven- 
tion; 

FIG. 3 is a flowchart diagram illustrating operation of the 
present invention; 

FIG. 4 is a more detailed block diagram of the CPU of 
FIG. I; 

FIG. 5 is a block diagram of the Instruction Decode Unit 
of FIG. 4; 

FIG. 6 is a block diagram of the function preprocessor 
including a pattern recognition detector according to one 
embodiment of the invention; 

FIG. 7 illustrates operation of the pattern recognition 
detector of FIG. 6; 

FIG. 8 is a block diagram of the function preprocessor 
including a look-up table according to one embodiment of 
the invention; 
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FIG. 9 illustrates operation of the look-up table of FIG. 8; 
and 

FIG. 10 is a block diagram diagram of the CPU according 
to the second emobodiment. 

FIG. 11 is a flowchart diagram illustrating a second 5 
embodiment of the present invention. 

FIG. 12 illustrates one embodiment of the processor mode 
register. 

FIG. 13 illustrates one embodiment of an instruction 
sequence which includes a DSP instruction .sequence. \o 

DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENT 
Incorporation by Reference 

Pentium System Architect are by Don Anderson and Tom 
Sli a n Icy and available from Mmdshare Press, 2202 Butter- 
cup Dr., Richardson, Tex. 75082 (214) 231-2216, is hereby 
incorporated by reference in its entirety. 

Digital Signal Processing Applications Using the ADSP- 
2 100 Family Volumes 1 and 2, 1995 edition, available from 
Analog Devices Corporation of Norwood Mass., is hereby 20 
incorporated by reference in its entirety. 

llie Intel CPU Handbook, 1994 and 1995 editions, avail- 
able from Intel Corporation, are hereby incorporated by 
reference in their entirety. 

The AMD K5 Handbook, 1995 edition, available from 25 
Advanced Micro Devices Corporation, is hereby incorpo- 
rated by reference in its entirely. 
Computer System Block Diagram 

Referring now to FIG. 1, a block diagram of a computer 
system incorporating a central processing unit (CPU) or 30 
microprocessor 102 according to the present invention is 
shown. The computer system shown in F(G. 1 is illustrative 
only, and the CPU 102 of the present invention may be 
incorporated into any of various types of computer systems. 

As shown, the CPU 102 includes a general purpose CPU 35 
core 212 and a DSP core 214. The general purpose core 212 
executes general purpose (no n- DSP) opcodes and the DSP 
core 214 executes DSP-type functions, as described (v rt her 
below. In the preferred embodiment, the general purpose 
CPU core 212 is an XS6 core, i.e., is compatible with the 40 
X86 family of microprocessors. However, the general pur- 
pose CPU core 212 may be any of various types of CPUs, 
including the PowerPC family, the DEC Alpha, and the 
SunSparc family of processors, among others. In the fol- 
lowing disclosure, the general purpose CPU core 212 i.s 45 
referred to as an X8G core for convenience. Hie general 
purpose core 212 may comprise one or more general pur- 
pose execution units, and the DSP core 214 may comprise 
one or more digital signal processing execution units. 

As discussed further below, the CPU includes a switch 50 
213 for selecting a processor execution mode. The switch 
213 selectively enables processing of general purpose 
instructions, e.g., APX instructions, or DSP instructions. In 
the preferred embodiment comprising an APX -based CPU, 
the CPU includes one or more bits in a register, referred to 55 
as processor mode bits, that are set to indicate whether the 
instruction decode engine should interpret the incoming 
code sequence as DSP instructions or APX instructions. 
Thus, for example, the processor mode bit is set to indicate 
a sequence of DSP instructions, and the processor mode bit 60 
is cleared to indicate that the program sequence reverts back 
to a normal APX mode of operation. The CPU 102 may 
include other means for indicating or differentiating between 
APX and DSP instructions, as desired. 

The CPU 102 also includes a preprocessor 204 which 65 
examines the processor mode bit and selectively provides 
instructions to either the X86 core 212 or the DSP 214. 



As shown, the CPU 102 is coupled through a CPU local 
bus 104 to a host/PCI /cache bridge or chipset 106. The 
chipset 106 is preferably similar to the Triton chipset avail- 
able from Intel Corporation. A second level or L-2 cache 
memory (not shown) may be coupled to a cache controller 
in the chipset, as desired. Also, for some processors the 
external cache may be an LI or first level cache. The bridge 
or chipset 106 couples through a memory bus 108 to main 
memory 110. The main memory 110 is preferably DRAM 
(dynamic random access memory) or EDO (extended data 
out) memory, or other types of memory, as desired. 

The chipset 106 includes various peripherals, including an 
inter nipt system, a real time clock (RTC) and timers, a direct 
memory access (DMA) system, ROM/Flash memory, com- 
munications ports, diagnostics ports, command/status 
registers, and non-volatile static random access memory 
(N VSR AM) (all not shown). 

The host/PCI/cache bridge or chipset 106 interfaces to a 
peripheral component interconnect (PCI) bus 120. In the 
preferred embodiment, a PCI local bus i.s used. However, it 
is noted that other local buses may be used, such as the 
VESA (Video Electronics Standards Association) VL bus. 
Various types of devices may be connected to the PCI bus 
120. In the embodiment shown in FIG. L, a video/graphics 
controller or adapter 170 and a network interface controller 
140 are coupled to the PCI bus 120. The video adapter 
connects to a video monitor 172, and the network interface 
controller 140 couples to a local area network (LAN). A 
SCSI (small computer systems interface) adapter 122 may 
also be coupled to the PCI bus 120. as shown. The SCSI 
adapter 122 may couple to various SCSI devices 124, such 
as a CD-ROM drive and a tape drive, as desired. Various 
other devices may be connected to the PCI bus 120, as is 
well known in the art. 

Expansion bus bridge logic 150 may also be coupled to 
the PCI bus 120. 'The expansion bus bridge logic 150 
interfaces to an expansion bus 152. The expansion bus 152 
may be any of varying types, including the industry standard 
architecture (ISA) bus, also referred to as the AT bus, the 
extended industry standard architecture (EISA) bus. or the 
MicroChannel architecture (MCA) bus. Various devices 
may be coupled to the expansion bus 152, such as expansion 
bus memory 154 and a modem 156. 
CPU Block Diagram 

Referring now to FIG. 2, a high level block diagram 
illustrating certain components in the CPU 102 of FIG. 1 is 
shown. As shown, the CPU 102 includes an instruction 
cache or instruction memory 202 which receives instructions 
or opcodes from the system memory 110. Function prepro- 
cessor 204 is coupled to the instruction memory 202 and 
examines instruction sequences or opcode sequences in the 
instruction memory 202. The function preprocessor 204 is 
also coupled to the X86 core 212 and the DSP core 214. The 
function preprocessor 204 is further coupled to the processor 
mode register 2 L3 storing the processor mode bit. As shown, 
the function preprocessor 204 examines the processor mode 
bit and selectively provides instructions or opcodes to either 
the X86 core 212 or selectively provides op-codes or infor- 
mation to the DSP core 214. 

The XS6 core 212 and DSP core 214 are coupled together 
and provide data and timing signals between each other. In 
one embodiment, the CPU 102 includes one or more buffers 
(not shown) which interface between the X86 core 212 and 
the DSP core 214 to facilitate transmission of data between 
the X86 core 212 and the DSP core 214. 

In a first embodiment, the CPU 102 receives only APX 
instructions. In this first embodiment, if the processor mode 
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hii is set to indicate DSP functions, the function preprocessor 
204 examines the sequences of APX instructions or opcodes 
(XS6 opcodes) and converts or maps the instruction 
sequence to a DSP macro instruction or function identifier 
that is provided to the DSP core. The processor mode bit is 5 
thus set to indicate the start of an APX code sequence which 
implements a DSP function. The function preprocessor 204 
examines the processor mode bit to determine if a DSP 
function is being executed by the APX code sequence. If the 
function preprocessor 204 determines that a DSP function is JO 
being executed based on the processor mode bit, the function 
preprocessor 204 converts or maps the instruction sequence 
to a DSP macro instruction or function identifier that is 
provided to the DSP core 21.4. The DSP core 2.14 executes 
one or more DSP instructions to implement the desired DSP 15 
function indicated by the DSP macro or function identitier. 
The DSP core 214 preferably performs the DSP function in 
parallel with other operations performed by the general 
purpose CPU core 212 for increased system performance. 

In one embodiment, the processor mode register 213 20 
stores the processor mode bit, and also includes one or more 
bits, preferably a plurality of bits, which identify the type of 
DSP function implemented by the instruction sequence. 
Thus, the preprocessor 204 examines the processor mode bit 
to determine if the APX code sequence implements a DSP 25 
function. IT so, the preprocessor 204 examines the plurality 
of bits to determine the general type of DSP function being 
implemented. The preprocessor 204 uses the information on 
the general type of DSP function in creating the function 
identifier, and the preprocessor 204 also examines the 30 
instruction sequence to extract values and parameters nec- 
essary for the DSP core to implement the DSP function. 

In a second embodiment, the CPU 212 receives an 
instruction sequence which comprises sequences of general 
purpose, e.g., APX instructions, and which also comprises 35 
sequences of DSP instructions. The respective processor 
mode bit is set to indicate the beginning of a sequence of 
DSP instructions, and the processor mode bit is cleared to 
indicate the beginning of a sequence of APX instructions. 
The pre-processor 204 thus routes the instructions to the an 
APX core or the DSP core based on the status of the 
processor mode bit. In this embodiment, the pre-processor 
204 is not required to map APX instructions into DSP 
macros, but rather simply routes APX instructions to the x 
86 core 212 and routes DSP instructions to the DSP core 214 45 
based on the status of the processor mode bit. 
FIG. 3 — Flowchart 

Referring now to FIG. 3, a flowchart diagram illustrating 
operation of the first embodiment of the present invention is 
shown. It is noted that two or more of the steps in FIG. 3 may 50 
operate concurrently, and the operation of the invention is 
shown in flowchart form for convenience. 

As shown, in step 302 the instruction memory 202 
receives and stores a plurality of X86 instructions. The 
plurality of XS6 instructions may include one or more 55 
instruction sequences which implement a DSP function. 

In step 304 the function preprocessor 204 analyzes the 
processor mode bit. The value of the processor mode bit is 
preferably set by the program, i.e., the program which 
comprises the instruction sequences being examined. As 60 
noted above, in the first embodiment, the processor mode bit 
is set to indicate that the sequence of instructions are 
designed or intended to perform a DSP-type function. The 
processor mode bit is cleared to indicate that the sequence of 
instructions are a regular sequence of XS6 instructions that 65 
are not intended to perform a DSP- type function. In the 
present disclosure, a DSP-type function comprises one or 
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more of the following mathematical functions: correlation, 
convolution, Fast Fourier Transform, finite impulse response 
1 1 Iter, infinite impulse response filter, inner product, and 
matrix manipulation, among others. 

In step 306 the function preprocessor 204 determines, 
based on the status of the processor mode bit, if the sequence 
of instructions are designed or intended to perform a DSP- 
type function. 

If the processor mode bit is cleared to indicate that the 
instructions or opcodes stored in the instruction cache 202 
do not correspond to a DSP-type function, the instructions 
are provided to the XS6 core 212 in step 308. Thus, these 
instructions or opcodes are provided directly from the 
instruction cache 202 to the X86 core 2.12 for execution, as 
occurs in prior art X86 compatible CPUs. After the opcodes 
are transferred to the X86 core 2 12, in step 310 the X86 core 
212 executes the instructions. 

If the processor mode bit is set to indicate that the 
sequence of instructions correspond to or implement a 
DSP-lype fti net ion in step 306, then in step 312 the function 
preprocessor 204 analyzes the sequence of instructions and 
determines the respective DSP-type function being imple- 
mented. In step 312 the function preprocessor 204 maps the 
sequence of instructions to a respective DSP macro 
identifier, also referred to as a function identifier. The 
function preprocessor 204 also analyzes the information in 
the sequence of opcodes in step 312 and generates zero or 
more parameters for use by the DSP core or accelerator 214 
in executing the function identifier. 

As described above, in one embodiment of the invention, 
the processor mode register 213 stores a processor mode bit 
and in addition stores one or more bits, preferably a plurality 
of bits, which indicate the general type of DSP function 
being performed. Thus the application program writes a 
value into the processor mode register indicating the type of 
DSP function being implemented by the APX instruction 
sequence. In this embodiment, in step 312 the preprocessor 
204 uses t he value indicating the type of DSP function to aid 
in converting the sequence of instructions into a DSP 
function identifier and zero or more parameters. Thus, in this 
embodiment, the preprocessor 204 examines the processor 
mode bit in step 304 to determine if the APX code sequence 
implements a DSP function. If so, in step 312 the prepro- 
cessor 204 examines the plurality of bits to determine the 
general type of DSP function being implemented. The 
preprocessor 204 then examines the instruction sequence in 
step 312 to extract values and parameters necessary for the 
DSP core to implement the DSP function. 

As shown, after the preprocessor 204 has generated the 
function identifier and the parameters in step 312. in step 
314 the function preprocessor 204 provides the function 
identifier and the parameters to the DSP core 214. 

The DSP core 214 receives the function identifier and the 
associated parameters from the function preprocessor 204 
and in step 316 performs the respective DSP function. In the 
preferred embodiment, the DSP core 214 uses the function 
identifier to index into a DSP microcode RAM or ROM to 
execute a sequence of DSP instructions or opcodes. The DSP 
instructions cause the DSP to perform the desired DSP-type 
function. The DSP core 214 also uses the respective param- 
eters in executing the DSP function. 

As mentioned above, the X86 core 212 and DSP core 214 
are coupled together and provide data and liming signals 
between each other. In the preferred embodiment, the X86 
core 212 and DSP core 214 operate substantially in parallel. 
Thus, while the X86 core 212 is executing one sequence of 
opcodes, the DSP accelerator 214 may be executing one or 
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more DSP functions corresponding 10 another sequence of 
opcodes. Thus, the DSP core 214 does not operate as a slave 
or co-processor, but rather operates as an independent 
execution unit or pipeline. The DSP core 214 and the X86 
core 212 provide data and liming signals to each other to 
indicate the status of operations and also to provide any data 
outputs produced, as well as to ensure data coherency/ 
independence. 
Example Operation 

[lie following describes an example of how a string or 
sequence of XS6 opcodes are converted into a function 
identifier and then executed by the DSP core or accelerator 
214 according to the present invention. The following 
describes an X86 opcode sequence which performs a simple 
inner product computation, wherein the inner product is 
averaged over a vector comprising 20 values: 
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example, the DSP code or instructions executed by the DSP 
core 2 14 in response to receiving the macro described above 
are shown below: 



DSP Code 
(Simple inner product) 



1 Cnh =num__somples: 


(Set up parameters fiom macro] 


1 pLrl =addiess._.l: 




1 ptr2 =address_2: 




! MAC=0: 


{Initialize sum of products} 


1 red = : "piil+~. 


(Pre-load multiplier input regisleis} 


rej:2 = r ptr2++: 




! Do LOOP until ce; 


{Specify loop parameters}- 


1 MAC -r=nce] "rej:2. 


{Form sum of products] 


reg] = "ptrl++. 




rea2 ~ K pU m 2++: 




LOOP: 


{Continue if more terms} 



X86 Code 
(Simple inner product) 



1 


Mov ECX, iuun__samples: 


{Set up parameters for macio] 


1 


Mov ESI. addres5__l 




:l 


Mov EDI. nddress_2 




1 


Mov LAX. 0: 


-{Initialize vector indices] 


l 


Mov EBX. 0: 




4 


FLdZ; 


{Initialize sum of products} 




Again: 






{Update counter} 

{Get vector elements and} 


4 


Fid dword ptr f ESI+EAX*4|: 


J 


Inc LAX: 


{update indices] 


4 

:l 


Fid dword ptr [ EDI + EBX "=41: 
Inc EBX: 


:13 


FMuiP St (J). St; 


{Compute product term} 


7 


FAddPSt(:l). St: 


{Add term tc sumj- 


J 


LOOP Again; 


{ Continue if more teims}- 



As shown, I he X86 opcode instructions for a simple inner 
product comprise a plurality of. move instructions followed 
by an I 7 - load function wherein this sequence is repeated a 
plurality of limes. If I h is XS6 opcode sequence were 
executed by the XS6 core 212, the execution time for this 
inner product compulation would require 709 cycles (9+20x 
35). This assumes i486 timing, concurrent execution of 
floating point operations, and cache hits for all instructions 
and data required for the inner product computation. The 
function preprocessor 204 analyzes the sequence of opcodes 
and detects that the opcodes are performing an inner product 
computation. The function preprocessor 204 then converts 
this entire sequence of XS6 opcodes into a single macro or 
function identifier and one or more parameters. An example 
macro or function identifier that is created based on the X86 
opcode sequence shown above would be as follows: 



Ex amp 


le Macro 


(as it appear; 


i in assembler) 


Innei product_simpie ( 




address 1. 


{Data vector] 


address_2. 


{Data vector] 


mim_samp]es); 


{Length of vector} 



This function identifier and one or more parameters are 
provided to the DSP core 2.14. llic DSP core 2 14 uses the 
macro provided from the function preprocessor 204 to load 
one or more DSP opcodes or instructions which execute the 
DSP function. In the preferred embodiment, the DSP core 
214 uses the macro to index into a ROM which contains the 
instructions used for executing the DSP function. In this 



In this example, the DSP core 214 performs litis inner 

20 product averaged over a vector comprising 20 values and 
consumes a total of 26 cycles (6+20x1). This assumes 
typical DSP liming, including a single cycle operation of 
instructions, zero overhead looping and cache hits for all 
instructions and data. Thus, the DSP core 214 provides a 

25 performance increase of over 28 limes of that where the X86 
core 212 executes this DSP function. 
FIG. 4 — CPU Block Diagram 

Referring now to PIG. 4, a more detailed block diagram 
is shown illustrating the internal components of the CPU 

30 102 according to the present invention. Elements in the CPU 
102 that are not necessary for an understanding of the 
present invention are not described for simplicity. As shown, 
in the preferred embodiment the CPU 102 includes a bus 
interface unii 440, instruction cache 202, a data cache 444, 

:o an instruction decode unil 402, a plurality of execute units 
448, a load/store unit 450, a reorder buffer 452, a register file 
454, and a DSP unil 214. 

As shown, the CPU 102 includes a bus interface unit 440 
which includes circuitry for performing communication 

^0 upon CPU bus 104. The bus interface unit 440 interfaces to 
the data cache 444 and the instruction cache 202. The 
instruction cache 202 prefetches instructions from the sys- 
tem memory 110 and stores the instructions for use by the 
CPU 102. The instruction decode unit 402 is coupled to ihe 

*5 instruction cache 202 and receives instructions from the 
instruction cache 202. The instruction decode unit 402 
includes function preprocessor 204 and processor mode 
register or bit 213, as shown. Ilie function preprocessor 204 
in the instruction decode unit 402 is coupled to ihe insiruc- 

50 lion cache 202. The instruction decode unit 402 further 
includes an instruction alignment unit as well as other logic. 

The instruction decode unit 402 couples lo a plurality of 
execution unils 44S, reorder buffer 452, and load/store unit 
450. The plurality of execute units are collectively referred 

55 to herein as execute units 448. Reorder buffer 452, execute 
unils 44N, and load/store unit 450 are each coupled to a 
forwarding bus 45N for forwarding of execution results. 
Load/store unit 450 is coupled lo data cache 444. DSP unit 
214 is coupled directly to the instruction decode unit 402 

do through the DSP dispatch bus 456. It is noted lhat one or 
more DSP unils 214 may be coupled to the instruclion 
decode unit 402. 

Bus interface unit 440 is configured to effect communi- 
cation between microprocessor 102 and devices coupled to 

65 system bus 104. For example, instruction fetches which miss 
instruction cache 202 are transferred from main memory 110 
by bus interface unil 440. Similarly, data requests performed 
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by load/store unit 450 which miss data cache 444 are 
transferred from main memory 110 by bus interface unit 
440. Additionally, data cache 444 may discard a cache line 
of data which has been modif ied by microprocessor .102. litis 
interface unit 440 transfers the modified line to main 5 
memory 110. 

Instruction cache 202 is preferably a high speed cache 
memory for storing instructions. It is noted thai instruction 
cache 202 may be configured into a set -associative or direct 
mapped configuration. Instruction cache 202 may addition- :io 
ally include a branch prediction mechanism for predicting 
branch instructions as either taken or not taken. A "taken" 
branch instruction causes instruction fetch and execution to 
continue at ihe target address of the branch instruction. A 
"not laken ;; branch instruction causes instruction fetch and is 
execution to continue al the instruction subsequent to the 
branch instruction. Instructions are fetched from instruction 
cache 202 and conveyed to instruction decode unit 402 for 
decode and dispatch to an execution unit. The instruction 
cache 202 may also include a macro prediction mechanism 20 
for predicting macro instructions and taking the appropriate 
action. 

Instruction decode unit 402 decodes instructions received 
from the instruction cache 202 and provides the decoded 
instructions to the execute units 448, the load/store unit 450, 25 
or the DSP unit 214. The instruction decode unit 402 is 
preferably configured to dispatch an instruction to more than 
one execute unit 44S, 

The instruction decode unit 402 includes function pre- 
processor 204. According to the first embodiment of the 30 
present invention, the function preprocessor 204 in the 
instruction decode unit 402 is configured to examine the 
status of the processor mode bit 2 13 to determine whether an 
X86 instruction sequence in the instruction cache 202 cor- 
responds to or performs DSP functions. If the processor 35 
mode bit 213 is set to indicate such an instruction sequence, 
the function preprocessor 204 generates a corresponding 
macro and parameters and transmits the corresponding DSP 
macro and parameters to the DSP Unit 214 upon DSP 
dispatch bus 456. The DSP unit 214 receives I he DSP 
function macro and parameter information from the instruc- 
tion decode unit 402 and performs the indicated DSP func- 
tion. Additionally, DSP unit 214 is preferably configured to 
access data cache 444 for data operands. Data operands may 
be stored in a memory within DSP unit 214 for quicker 45 
access, or may be accessed directly from data cache 444 
when needed. Function preprocessor 204 provides feedback 
to instruction cache 202 to ensure that sufficient look ahead 
instructions are available for macro searching. 

If the processor mode bit 213 indicates that the X86 50 
instructions in the instruction cache 202 are not intended to 
perform a DSP function, the instruction decode unit 402 
decodes the instructions fetched from insi ruction cache 202 
and dispatches the instructions to execute units 448 and/or 
load/store unit 450. Instruction decode unit 402 also detects 55 
the register operands used by the instruction and requests 
these operands from reorder butler 452 and register file 454. 
Execute units 448 execute the X86 instructions as is known 
in the art 

Also, if the DSP 214 is not included in the CPU 102 or is 60 
disabled through software, instruction decode unit 402 dis- 
patches all XS6 instructions to execute units 448. Execute 
units 448 execute the X86 instructions as in the prior art. In 
this manner, if the DSP unit 214 is disabled, the XS6 code, 
including the instructions which perform DSP functions, are 65 
executed by the X86 core, as is currently done in prior art 
X86 microprocessors. Thus, if the DSP unit 214 is disabled, 



the program executes correctly even though operation is less 
efficient than the execution of a corresponding routine in the 
DSP 214. Advantageously, the enabling or disabling, or the 
presence or absence, of the DSP core 214 in the CPU 102 
does not affect the correct operation of the program. 

In one embodiment, execute units 448 are symmetrical 
execution units that are each configured to execute the 
instruction set employed by microprocessor 102. In another 
embodiment, execute units 448 are asymmetrical execution 
units configured to execute dissimilar instruction subsets. 
For example, execute units 448 may include a branch 
execute unit for executing branch instructions, one or more 
arithmetic/logic units for executing arithmetic and logical 
instructions, and one or more floating point units for execut- 
ing floating point instructions. Instruction decode unit 402 
dispatches an instruction to an execute unit 448 or load/store 
unit 450 which is configured to execute that instruction. 

Load/store unit 450 provides an interface between execute 
units 448 and data cache 444. Load and store memory 
operations are performed by load/store unit 450 to data 
cache 444. Additionally, memory dependencies between 
load and store memory operations arc detected and handled 
by load/store unit 450. 

Execute units 448 and load /store unit(s) 450 may include 
one or more reservation stations for storing instructions 
whose operands have not yet been provided. An instruction 
is selected from those stored in the reservation stations for 
execution if: (1) the operands of the instruction have been 
provided, and (2) the instructions which are prior to the 
instruction being selected have not yet received operands. It 
is noted that a centralized reservation station may be 
included instead of separate reservations stations. The cen- 
tralized reservation station is coupled between instruction 
decode unit 402, execute units 448, and load/store unit 450. 
Such an embodiment may perform the dispatch function 
within the centralized reservation station. 

CPU 102 preferably supports out of order execulion and 
employs reorder buffer 452 for storing execution results of 
speculatively executed instructions and storing these results 
into register file 454 in program order, for performing 
dependency checking and register renaming, and for pro- 
viding for mispredicted branch and exception recovery. 
When an instruction is decoded by instruction decode unit 
402, requests for register operands are conveyed to reorder 
buffer 452 and register file 454. In response to the register 
operand requests, one of three values is transferred to the 
execute unit 448 and/or load/store unit 450 which receives 
the instruction: (.1) the value stored in reorder buffer 452, if 
the value has been speculatively generated; (2) a tag iden- 
tifying a location within reorder buffer 452 which will store 
the result, if the value has not been speculatively generated; 
or (3) the value stored in the register within register file 454, 
if no insiruciions with in reorder buffer 452 modify the 
register. Additionally, a storage location within reorder 
buffer 452 is allocated for storing the results of the instruc- 
tion being decoded by instruction decode unit 402. The 
storage location is identified by a tag, which is conveyed to 
(he unit receiving the instruction. It is noted that, if more 
than one reorder buffer storage location is allocated for 
storing results corresponding to a particular register, the 
value or tag corresponding to the last result in program order 
is conveyed in response to a register operand request for that 
particular register. 

When execute units 448 or load/store unit 450 execute an 
instruction, the tag assigned to the instruction by reorder 
buffer 452 is conveyed upon result bus 458 along with the 
result of the instruction. Reorder bu flier 452 stores the result 
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in the indicated storage location. Additionally, execute units 
448 and load/store unit 450 compare the tags conveyed upon 
result bus 458 with lags of operands for instructions stored 
therein. If a match occurs, the unit captures the result from 
result bus 458 and stores it with the corresponding instruc- 
tion. In this manner, an instruction may receive the operands 
il is intended to operate upon. Capturing results from result 
bus 458 for use by instructions is referred to as "result 
forwarding". 

Instruction results are stored into register file 454 by 
reorder buffer 452 in program order. Storing the results of an 
instruction and deleting the instruction from reorder butler 
452 is referred to as "retiring'' the instruction. By retiring the 
instruct ions in program order, recovery from incorrect 
speculative execution may be performed. For example, if an 
instruction is subsequent to a branch instruction whose 
taken/not taken prediction is incorrect, then the instruction 
may be executed incorrectly. When a mispredicted branch 
instruction or an instruction which causes an exception is 
delected, reorder buffer 452 discards the instructions subse- 
quent to the mispredicted branch instructions. Instructions 
thus discarded are also Hushed from execute units 448, 
load/store unit 451), and instruction decode unit 402. 

Register file 454 includes storage locations for each 
register defined by the microprocessor architecture 
employed by microprocessor 102. For example, in the 
preferred embodiment where the CPU 102 includes an xS6 
microprocessor architecture, the register file 454 includes 
locations for storing the FAX, E.BX~ ECX, EDX, ESI, EDI, 
ESP, and EBP register values. 

Data cache 444 is a high speed cache memory configured 
to store data to be operated upon by microprocessor 102. It 
is noted that data cache 444 may be configured into a 
set-associative or direct-mapped configuration. 

For more information regarding the design and operation 
of an XS6 compatible microprocessor, please see co-pending 
patent application entitled "High Performance Superscalar 
Microprocessor", Scr. No. 0S/146,3S2, filed Oct. 29, 1993 
by Witt, et a 1, n o w U.S. Pal. N o . 5 ,65 1 , 1 2 5 , a nd co- pe ndi ng 
patent application entitled "Superscalar Microprocessor 
Including a High Performance Instruction Alignment Unit", 
Ser. No.~0S/377,S43, filed Jan. 25, 1995 by Witt, ut al now 
U.S. Pat. No. 5,819,057, which are both assigned to the 
assignee of the present application, and which are both 
hereby incorporated by reference in their entirety as though 
fully and completely set forth herein. Please also see "Super- 
scalar Microprocessor Design" by Mike Johnson, Prentice- 
Hall, Englewood Cliffs, N.J., 1991, which is hereby incor- 
porated herein by reference in its entirety. 
FIG. 5 — Instruction Decode Unit 

Referring now to FIG. 5, one embodiment of instruction 
decode unit 402 is shown. Instruction decode unit 402 
includes an instruction alignment unit 460, a plurality of 
decoder circuits 462. processor mode register or bit 213, and 
a DSP function preprocessor 204. Instruction alignment unit 
460 is coupled to receive instructions fetched from instruc- 
tion cache 202 and aligns instructions to decoder circuits 
462. 

Instruction alignment unit 260 routes instructions to 
decoder circuits 462. In one embodiment, instruction align- 
ment unit 260 includes a byte queue in which instruction 
bytes fetched from instruction cache 202 are queued. 
Instruction alignment unit 460 locales valid instructions 
from within the byte queue and dispatches the instructions to 
respective decoder circuits 462. In another embodiment, 
instruction cache 202 includes predecode circuitry which 
predecodes instruction bytes as they are stored into instruc- 



tion cache 202. Start and end byte information indicative of 
the beginning and end of instructions is generated and stored 
within instruction cache 202. The predecode data is trans- 
ferred to instruction alignment unit 460 along with the 

5 instructions, and instruction alignment unit 460 transfers 
instructions to the decoder circuits 462 according to the 
predecode information. 

The function preprocessor 204 is also coupled to the 
instruction cache 202. As described above, the function 

10 preprocessor 204 examines the processor mode bit in order 
to detect instruction sequences in the instruction cache 202 
which perform DSP instructions. Decoder circuits 462 and 
function preprocessor 204 receive X86 instructions from the 
instruction alignment unit 460. The function preprocessor 

'15 204 provides an instruction disable signal upon a DSP bus 
to each of the decoder units 462. 

Each decoder circuit 462 decodes the instruction received 
from instruction alignment unit 460 to determine the register 
operands manipulated by the instruction as well as the unit 

20 to receive the instruction. An indication of the unit to receive 
the instruction as well as the instruction itself are conveyed 
upon a plurality of dispatch buses 468 to execute units 448 
and load/store unit 450. Other buses, not shown, are used to 
request register operands from reorder buffer 452 and reg- 

25 ister file 454. 

The function preprocessor 204 examines the processor 
mode bit lo determine if streams or sequences of XS6 
instructions from the instruction cache 202 implement a DSP 
function. If so, the function preprocessor 204 maps the XS6 

30 instruction stream to a DSP macro and zero or more param- 
eters and provides this information to one of the one or more 
DSP units 214. In one embodiment, when the respective 
instruction sequence reaches the decoder circuits 462, the 
function preprocessor 204 asserts a disable signal to each of 

35 the decoders 462 to disable operation of the decoders 462 for 
the detected instruction sequence. When a decoder circuit 
462 detects the disable signal from function preprocessor 
204, the decoder circuit 462 discontinues decoding opera- 
tions until the disable signal is released. After the instruction 

*o sequence corresponding to the DSP function has exited the 
instruction cache 202, the processor mode bit is cleared, and 
the function preprocessor 204 removes the disable signal to 
each of the decoders 462. In other words, once the processor 
mode bit is cleared and the function preprocessor 204 

45. detects the end of the XS6 instruction sequence, the function 
preprocessor 204 removes the disable signal to each of the 
decoders 462, and the decoders resume operation. 

Each of decoder circuits 462 is configured to convey an 
instruction upon one of dispatch buses 468, along with an 

50 indication of the unit or units lo receive the instruction. In 
one embodiment, a bit is included within the indication for 
each of execute units 448 and load/store unit 450. If a 
particular bit is set, the corresponding unit is to execute the 
instruction. If a particular instruction is to be executed by 

55 more than one unit, more than one bit in the indication may 
be set. 

Function Preprocessor 

As shown in FIG. 5, in the first embodiment the function 
preprocessor 204 comprises a conversion/mapping circuit 

oO 506 for converting a sequence of instructions in the instruc- 
tion memory 202 which implements a digital signal pro- 
cessing function into a digital signal processing function 
identifier or macro identifier and zero or more parameters. 
Thus if the processor mode bit indicates that the sequence of 

C»5 instructions in the instruction memory 202 implements a 
DSP function, the conversion/mapping circuit 506 converts 
this sequence of instructions into a DSP function identifier 
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and zero or more parameters. For example, if the instruction 
sequence determination circuit 504 examines and deter- 
mines that the sequence of instructions in the instruction 
memory 202 implements an FFT function, the conversion/ 
mapping circuit 506 converts this sequence of instructions 
into a FFT Function identifier and zero or more parameters. 

As discussed above with respect to step 312 of FIG. 3, in 
one embodiment of the invention the processor mode reg- 
ister 2.13 stores a processor mode bit and in addition stores 
one or more bits, preferably a plurality of bits, which 
indicate the general type of DSF function being performed. 
Thus the application program writes a value into the pro- 
cessor mode register 213 indicating the type of DSP function 
being implemented by ihe APX instruction sequence. The 
conversion/mapping circuit 506 uses the value indicating the 
type of DSP function to aid in converting the sequence of 
instructions into a DSP function identifier and zero or more 
parameters. 

FIG. 6 — Pattern Recognition Circuit 

Referring now to FIG. 6, in one embodiment the function 
preprocessor 204 includes a pa Hern recognition circuit or 
pattern recognition detector 512 which determines whether 
a sequence of instructions in the instruction memory 202 
implements a digital signal processing function. The pattern 
recognition circuit 512 is used lo convert the sequence of 
instructions into a DSP function identifier and zero or more 
parameters. 

Hie pattern recognition circuit 512 stores a plurality of 
patterns of instruction sequences which implement digital 
signal processing functions. The pattern recognition circuit 
512 stores bit patterns which correspond to opcode 
sequences of machine language instructions which perform 
DSP functions, such as FFTs, inner products, matrix 
manipulation, correlation, convolution, etc. 

For instruction sequences where the processor mode bit is 
set to indicate that the sequence implements a DSP function, 
the pattern recognition detector 5L2 compares each of the 
patterns with the respective instruction sequence. The pat- 
tern recognition detector 512 examines the sequence of 
instructions stored in the instruction memory 202 and com- 
pares the sequence of instructions with the plurality of stored 
patterns. Operation of the pattern recognition detector 512 is 
shown in FIG. 7. The pattern recognition detector 512 may 
include a look-up table as the unit which performs the 
pattern comparisons, as desired. The pattern recognition 
detector 512 may also perform macro prediction on instruc- 
tion sequences lo improve performance. 

The pattern recognition detector 512 determines whether 
the sequence of instructions in the instruction memory 202 
substantially matches one of the plurality of stored patterns. 
A substantial match indicates that the sequence of instruc- 
tions implements the respective digital signal processing 
function. In the preferred embodiment, a substantial match 
occurs where the instruction sequence matches a stored 
pattern by greater than 90%. Other matching thresholds, 
such as 95%. or 100%. may be used, as desired. The pattern 
recognition detector 512 determines the type of DSP func- 
tion pattern which matched the sequence of instructions and 
passes this DSP function type to the conversion/mapping 
circuit 506. 

FIG. Look-up Table 

Referring now to FIG. 8, in another embodiment the 
conversion/mapping circuit 506 includes a look-up table 
(LUT) 514 which determines the digital signal processing 
function that corresponds to a sequence of instructions in the 
instruction memory 202. In this embodiment, the look-up 
table 514 may be in addition to, or instead of, the pattern 
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recognition detector 512. Thus the LUT 514 is used in 
converting the sequence of instructions into a DSP function 
identifier and zero or more parameters. The LUT operates as 
shown in FIG. 9. 

5 In an embodiment where the function preprocessor 204 
includes only the look-up table 514, the look-up table 514 
stores a plurality of patterns wherein each of the patterns is 
at least a subset of an instruction sequence which imple- 
ments a digital signal processing function. Thus, this 

_ 1(3 embodiment is similar to the embodiment of FIG. 6 
described above, except that the function preprocessor 204 
includes the look-up table 514 instead of the pattern recog- 
nition detector 51.2 for determining which DSP function 
corresponds to an instruction sequence. In this embodiment, 
the look-up table 514 requires an exact match with a 
corresponding sequence of instructions. If an exact match 
does not occur, then the sequence of instructions are passed 
to the one or more general purpose execution units, i.e., the 
general purpose CPU core, for execution. 

FIG. 9 illustrates operation of the look-up table 514 in this 

20 embodiment. As shown, a sequence of instructions in the 
instruction cache 202 are temporarily stored in the instruc- 
tion latch 542. If the processor mode bit indicates that the 
instruction sequence implements a DSP function, then the 
contents of the instruction latch 542 are then compared with 

25 each of the entries in the look-up table 514 by element 546. 
If the contents of the instruction latch 542 exactly match one 
of the entries in the look-up table 514, then the DSP function 
or instruction 54S which corresponds to this entry is pro- 
vided to the DSP execution unit 214. 

30 In the above embodiments of FIGS. 6 and H, the pattern 
recognition detector 512 and/or the look-up table 514 are 
configured lo determine the DSP function which corre- 
sponds to an instruction sequence only when the determi- 
nation can be made with relative certainty. This is because 

35 a "missed" instruction sequence, i.e., an instruction 
sequence which implements a DSP function, wherein the 
type of DSP instruction could not be positively identified, 
will not affect operation of the CPU 102, since the general 
purpose core or execution units can execute the instruction 

an sequence. However, an instruction sequence which does 
implements a DSP function that is mis-identified, i.e., the 
wrong DSP function is determined to be im pi em en led, is 
more problematic, and could result in possible erroneous 
operation. Thus it is anticipated that the pattern recognition 

=15 detector 512 or the look-up table 5.14 may not accurately 
detect every instruction sequence which implements a DSP 
function. In this instance, even though the processor mode 
bit was set to indicate that the instruction sequence imple- 
mented a DSP function, the instruction sequence is preier- 

50 ably passed on to one of the general purpose execution units, 
as occurs in the prior art. 
FIG. 10 — Second Embodiment 

FIG. 10 is a high level block diagram of the CPU 102 
according to the second embodiment of the invention. Thus, 

55 FIG. 10 is similar to FIG. 2, but illustrates the second 
embodiment described above. 

As shown, the CPU 102 includes an instruction cache or 
instruction memory 202 which receives instructions or 
opcodes from the system memory 1.10. In this second 

60 embodiment, the instructions comprise sequences of x86 or 
APX instructions and sequences of DSP instructions. Thus, 
unlike the first embodiment of FIG. 2 wherein all received 
instructions were APX instructions, in this second embodi- 
ment the received instructions comprises APX instruction 

65 sequences and DSP instruction sequences. 

Preprocessor 204 A is coupled to the instruction memory 
202 and examines instruction sequences or opcode 
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sequences in ihe instruction memory 202. 'Hie preprocessor 
204A is also coupled to the X86 core 212 and the DSP core 
214. The function preprocessor 204 A is further coupled to 
the processor mode register 2 13 storing the processor mode 
bit. As shown, the preprocessor 204A examines the proces- 5 
sor mode bit and selectively provides APX instructions or 
opcodes to the X86 core 212 or selectively provides DSP 
op-codes or instructions to the DSP core 214. 

'Hie X86 core 212 and DSP core 214 are coupled together 
and provide data and timing signals between each other. In to 
one embodiment, the CPU 102 includes one or more buffers 
(not shown) which interface between the X86 core 212 and 
the DSP core 214 to facilitate transmission of data between 
the X86 core 2.12 and the DSP core 21.4. 

In this second embodiment, the CPU 212 receives instruc- J5 
tions which comprise sequences of general purpose, e.g., 
APX instructions, and which also comprises sequences of 
DSP instructions. The respective processor mode bit is set to 
indicate the beginning of a sequence of DSP instructions, 
and the processor mode bil is cleared to indicate the begin- 20 
ning of a sequence of APX instructions. The preprocessor 
204A thus routes the instructions to the APX core or the DSP 
core based on the status of the processor mode bit. In this 
embodiment, the pre-processor 204A is not required to map 
APX instructions into DSP macros, but rather simply routes 25 
APX instructions to the x 86 core 2.12 and routes DSP 
instructions to the DSP core 214 based on the status of the 
processor mode bit. 

FIG. 11 — Flowchart Diagram: Second Embodiment 

FIG. It is a flowchart diagram illustrating the second 30 
embodiment. As described above, in this second embodi- 
ment the CPU 102 receives an instruction sequence which 
comprises sequences of general purpose, e.g., APX 
instructions, and which also comprises sequences of DSP 
instructions. The respective processor mode bit is set to 35 
indicate the beginning of a sequence of DSP instructions, 
and the processor mode bil is cleared to indicate the begin- 
ning of a sequence of APX instructions. The CPU 102 thus 
routes the instructions to the APX core or the DSP core 
based on the status of the processor mode bit. ao 

As shown, in step 802 the CPU 102 receives sequences of 
instructions. As noted above, these instructions comprise 
sequences of general purpose, e.g., APX instructions, and 
also comprise sequences of DSP instructions. In step 804 the 
preprocessor 204 examines the processor mode bit to deter- 45 
mine if a respective sequence is a sequence of APX instruc- 
tions or a sequence of DSP instructions. 

In step 806 the preprocessor 204A determines, based on 
the status of the processor mode bit, if the respective 
sequence is a sequence of APX instructions or a sequence of 50 
DSP instructions. If the processor mode bit is cleared to 
indicate that the instructions or opcodes stored in the instruc- 
tion cache 202 are no I DSP instructions, the instructions are 
provided to the X86 core 212 in step 808. Thus, these 
instructions or opcodes are provided directly from the 55 
instruction cache 202 to the XS6 core 212 for execution, as 
occurs in prior art XS6 compatible CPUs. After the opcodes 
are transferred to the XS6 core 212, in step 810 the X86 core 
212 executes the instructions. 

If the processor mode bit is set to indicate that the 60 
sequence of instructions comprise DSP instructions in step 
806, then in step 812 the preprocessor 204 A provides the 
DSP instruction sequence to the DSP core 214. In step 314 
the DSP core 214 executes the DSP instructions. 
FIG. 12 — Processor Mode Register 65 

FIG. 12 illustrates one embodiment of the processor mode 
register 213. As shown, in one embodiment, a special 
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register in the APX CPU includes one or more bits, referred 
to as processor mode bits, assigned to indicate the processor 
mode, i.e., which indicate whether an instruction sequence 
comprises DSP instructions or implements a DSP function, 
or whether the instruction sequence is a regular APX instruc- 
tion sequence. 

FIG. 13 — Instruction Sequence 

PIG. 13 illustrates one embodiment of an instruction 
sequence which includes a DSP instruction sequence. As 
shown, after a number of APX instructions, e.g. three 
instructions, a DSP routine is called. The DSP routine sets 
(lie DSP bit to indicate the start of a sequence of DSP 
instructions. After the DSP instructions or operations are 
execuied by the DSP core 2.1.4, the routine clears the DSP bit 
and returns to execution of APX instructions. 
Conclusion 

Therefore, the present invention comprises a novel CPU 
or microprocessor architecture which optimizes execution of 
DSP and/or mathematical operations while maintaining 
backwards compatibility with existing software. 

Although the system and method of the present invention 
has been described in connection with the preferred 
embodiment, it is not intended to be limited to the specific 
form set forth herein, but on the contrary, it is intended to 
cover such alternatives, modifications, and equivalents, as 
can be reasonably included within the spirit and scope of the 
invention as defined by the appended claims. 

We claim: 

I. A central processing unit which performs general 
purpose processing functions and digital signal processing 
(DSP) functions, comprising: 

an instruction memory for storing a plurality of 
instructions, wherein said instruction memory stores 
one or more sequences of instructions which are 
intended to perform a DSP function; 

a processor mode memory for storing one or more pro- 
cessor mode bits, wherein said one or more processor 
mode bits indicate whether a sequence of instructions 
implements a DSP function; 

a function preprocessor coupled to the instruction 
memory and coupled to the processor mode memory, 
wherein the function preprocessor is operable to exam- 
ine said one or more processor mode bits in said 
processor mode memory to determine whether a 
sequence of said instructions in said instruction 
memory is intended to perform a digital signal pro- 
cessing function, wherein the function preprocessor is 
operable to convert said sequence of said instructions in 
said instruction memory into a DSP function identifier 
if said one or more processor mode bits in said pro- 
cessor mode memory indicate that said sequence of 
said instructions in said instruction memory is intended 
to perform a DSP function; 

a 1 least one general purpose processing core coupled to 
the function preprocessor for executing instructions in 
said instruction memory, wherein the function prepro- 
cessor provides a sequence of instructions to said at 
least one general purpose processing core if said one or 
more processor mode bits indicate that said sequence of 
said instructions in said instruction memory is not 
intended to perform a DSP function; 

al least one digital signal processing core coupled to the 
function preprocessor for performing digital signal 
processing functions, wherein the function preproces- 
sor is operable to provide said digital signal processing 
function identifier to said at least one digital signal 
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processing core, wherein the ai least one digital signal 
processing core receives said digital signal processing 
function identifier and performs a digital signal pro- 
cessing function in response to said received digital 
signal processing function identifier from said function 
preprocessor. 

2. The central processing unit of claim 1. wherein said 
instruction memory stores a first sequence of instructions 
which does not perform a digital signal processing function, 
and wherein said instruction memory stores a second 
sequence of instructions which performs a digital signal 
processing (hi net ion; 

wherein said at least one general purpose processing core 
executes said first sequence of instructions; 

wherein said at least one digital signal processing core 
performs said digital signal processing function in 
response to said received digital signal processing 
function identifier, wherein said digital signal process- 
ing function performed by said digital signal processing 
core is substantially equivalent to execution of said 
second sequence of instructions. 

3. 'Hie central processing unit of claim I, wherein said 
processor mode memory stores a respective value for said 
one or more processor mode bits tor each respective 
sequence of instructions in said instruction memory: 

wherein said respective value indicates whether said 
respective sequence of instructions implements a DSP 
function. 

4. The central processing unit of claim 1, wherein said 
processor mode memory stores a value indicating a type of 
DSP function implemented by a sequence of instructions; 

wherein said processor mode memory stores said value 
indicating said type of DSP function implemented by 
said sequence of instructions when said one or more 
processor mode bits indicate that said sequence of 
instructions implements a DSP function; 

wherein said function preprocessor uses said value indi- 
cating said type of DSP function implemented by said 
sequence of instructions in converting said sequence of 
said instructions in said instruction memory into a DSP 
function identifier. 

5. The central processing unit of claim I, wherein said at 
least one digital signal processing core provides data and 
timing signals to said at least one general purpose processing 
core. 

6. The central processing unit of claim 1, wherein said 
function preprocessor generates a digital signal processing 
function identifier and one or more parameters in response 
to said one or more processor mode bits indicating that said 
sequence of instructions in said instruction memory is 
intended to perform a digital signal processing function. 

7. The centra] processing unit of claim 1, wherein said at 
leasi one general purpose processing core is compatible with 
the X86 family of microprocessors. 

8. The central processing unit of claim 7, wherein said 
plurality of instructions are XS6 opcodes. 

9. The central processing unit of claim 1, wherein said at 
least one digital signal processing core is adapted for per- 
forming one or more mathematical operations from the 
group consisting of convolution, correlation, Past Fourier 
Transforms, and inner product. 

10. The central processing unit of claim .1, wherein said at 
least one general purpose processing core and said at least 
one digital signal processing core operate substantially in 
parallel. 

11. A method for executing instructions in a central 
processing unit (CPU), wherein the CPU includes at least 
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one general purpose CPU core and at least one digital signal 
processing (DSP) core, the method comprising: 

storing one or more sequences of instructions in an 
instruction memory for execution by the central pro- 
cessing unit; 

storing one or more processor mode bits in a processor 
mode memory, wherein said one or more processor 
mode bits indicate whether a sequence of instructions 
implements a DSP function; 

examining a sequence of instructions in said instruction 
memory; 

examining said one or more processor mode bits to 
determine whether said sequence of instructions in said 
instruction memory is intended to perform a DSP 
function; 

converting said sequence of instructions in said instruc- 
tion memory into a DSP function identifier if said one 
or more processor mode bits indicate that said sequence 
of instructions in said instruction memory is intended to 
perform a DSP function; 

the digital signal processing core receiving said DSP 
function identifier; 

the digital signal processing core performing a digital 
signal processing function in response to said received 
digital signal processing function identifier. 

12. The method of claim 11, further comprising: 

said general purpose central processing unit core execut- 
ing said sequence of instructions if said one or more 
processor mode bits indicate that said sequence of 
instructions in said instruction memory is not intended 
to perform a DSP function. 

13. The method of claim 12, further comprising: 
wherein said storing comprises storing a first sequence of 

instructions in said instruction memory which performs 
a first digital signal processing function; 

wherein said storing comprises storing a second sequence 
of instructions in said instruction memory which does 
not perform a digital signal processing function; 

wherein said converting converts said first sequence of 
instructions in said instruction memory which is 
intended to perform said first digital signal processing 
function into a first digital signal processing function 
identifier; 

wherein said performing comprises said digital signal 
processing core performing said first digital signal 
processing function in response to said first digital 
signal processing function identifier, wherein said per- 
forming said first digital signal processing function is 
substantially equivalent to execution of said first 
sequence of instructions; and 

said general purpose central processing unit core execut- 
ing said second sequence of instructions. 

14. The method of claim 11, wherein said storing one or 
more processor mode bits in the processor mode memory 
comprises storing a respective value for said one or more 
processor mode bits for each respective sequence of instruc- 
tions in said instruction memory; 

wherein said respective value indicates whether said 
respective sequence of instructions implements a DSP 
function. 

15. The method of claim 11, further comprising: 
storing a value in said processor mode memory indicating 

a type of DSP function implemented by a sequence of 
instructions; 
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wherein said processor mode memory stores said value 
indicating, said type of DSP function implemented by 
said sequence of instructions when said one or more 
processor mode bits indicate that said sequence of 
instructions implements a DSP function; 
wherein said function preprocessor uses said value indi- 
cating said type of OS I 3 function implemented by said 
sequence of instructions in converting said sequence of 
said instructions in said instruction memory into a DSP 
function identifier. 
L6. The method of claim 11, further comprising: 
said digila) signal processing core and said general pur- 
pose central processing unit core operating substan- 
tially in parallel. 
17. The method of claim 11, further comprising: 
said digital signal processing core providing data and 
timing signals to said general purpose central process- 
ing unit core. 
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IS. The method of claim 11, further comprising: 
said fu net ion preprocessor generating a digital signal 
processing function identifier and one or more param- 
eters in response to said determining that said sequence 
of instructions in said instruction memory is intended to 
perform a digital signal processing function, 
li). The method of claim 9, wherein said general purpose 
central processing unit core is compatible with the XS6 
JO family of microprocessors; 

wherein said one or more sequences of instructions com- 
prise X86 opcodes. 
20. The method of claim 11, wherein said digital signal 
processing core performs one or more mathematical opera- 
tions from the group consisting of convolution, correlation, 
Fast Fourier Transform, and inner product. 

***** 
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In one embodiment, a processor includes thread manage- 
ment logic including a thread predictor having state 
machines (o indicate whether thread creation opportunities 
should be taken or not taken. The processor includes a 
predictor training mechanism to receive retired instructions 
and to identify potential threads from the retired instruct ions 
and to determine whether a potential thread of interest meets 
a test of thread goodness, and if the test is met. one of the 
state machines that is associated with the potential thread of 
interest is updated in a take direction, and if the test is not 
met, the state machine is updated in a not take direction. The 
thread management logic may control creation of an actual 
thread and may further include reset logic to control whether 
the actual thread is reset and wherein if the actual thread is 
reset, one of the state machines associated with the actual 
thread is updated in a not take direction. The final retirement 
logic may control whether the actual thread is retired, and 
wherein if the actual thread is retired, the state machine 
associated with the actual thread is updated in a take 
direction. The circuitry may be used in connection with a 
multi-threading processor that detects speculation errors 
involving thread dependencies in execution of the actual 
threads and re-executes instructions associated with the 
speculation errors from trace buffers outside an execution 
pipeline. 

22 Claims, 10 Drawing Sheets 
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multithreading processor with 
thread predictor 

related application 

The present application is a continuation-in-part of U.S. 
application Ser. No. 08/992,375, filed Dec. 16, 1997, now 
pending. 

BACKGROUND OF THE INVENTION 

1 . Technical Field of the Invention 

Hie present invention relates to processors and, more 
particularly, to creation and management of threads in a 
processor. 

2. background Art 

Current superscaler processors, such as a microprocessor, 
perform techniques such as branch prediction and out-of- 
order execution to enhance performance. Processors having 
out-of-order execution pipelines execute certain instructions 
in a dilferent order than the order in which the instructions 
were fetched and decoded. Instructions may be executed out 
of order with respect to instructions for which there are not 
dependencies. Out-of-order execution increases processor 
performance by preventing execution units from being idle 
merely because of program instruction order. Instruction 
results are reordered after execution. 

The task of handling data dependencies is simplified by 
restricting instruction decode to being in-order. The proces- 
sors may then identify how data flows from one instruction 
to subsequent instructions through registers. To ensure pro- 
gram correctness, registers are renamed and instructions 
wail in reservation stations until their input operands are 
generated, at which lime they are issued to the appropriate 
functional units for execution. The register renamer, reser- 
vation stations, and related mechanisms link instructions 
having dependencies together so that a dependent instruction 
is not executed before the instruction on which it depends. 
Accordingly, such processors are limited by in-order fetch 
and decode. 

When the instruction from the instruction cache misses or 
a branch is mispredicted, the processors have either to wait 
until the instruction block is fetched from the higher level 
cache or memory, or until the mispredicted branch is 
resolved, and the execution of the false path is reset. The 
result of such behavior is that independent instructions 
before and after instruction cache misses and mispredicted 
branches cannot be executed in parallel, although it may be 
correct to do so. 

Multithreading processors such as shared resource mul- 
tithreading processors and on-chip multiprocessor (MP) 
processors have the capability to process and execute mul- 
tiple threads concurrently. The threads that these processors 
process and execute are independent of each other. For 
example, the threads are either from completely independent 
programs or are from the same program but are specially 
compiled to create threads without dependencies between 
threads. However, these processors do not have the ability to 
concurrently execute dilferent threads from the same pro- 
gram thai may have dependencies. The usefulness of the 
multithreading processors is thereby limited. 

Accordingly, there is a need for multithreading processors 
that have the ability to concurrently execute different threads 
from the same program where there may be dependencies 
among the threads. 

SUMMARY OF THE INVENTION 

In one embodiment, a processor includes thread manage- 
ment logic including a thread predictor having state 



17,121. 131 

2 

machines to indicate whether thread creation opportunities 
should be taken or not taken. The processor includes a 
predictor training mechanism to receive retired instructions 
and to identify potential threads from the retired instructions 

5 and to determine whether a potential thread of interest meets 
a test of thread goodness, and if the test is met. one of the 
stale machines that is associated with the potential thread of 
interest is updated in a take direction, and if the test is not 
met, the state machine is updated in a not lake direction. 

JO The thread management logic may control creation of an 
actual thread and may further include reset logic to control 
w : hether the actual thread is reset and wherein if the actual 
thread is reset, one of the state machines associated with the 
actual thread is updated in a not lake direction. The final 

1? retire merit logic may control whether the actual thread is 
retired, and wherein if the actual thread is retired, the state 
machine associated with the actual thread is updated in a 
take direction. 

The circuitry may be used in connection with a multi- 
2i) threading processor that detects speculation errors involving 
thread dependencies in execution of the actual threads and 
re-executes instructions associated with the speculation 
errors from trace buffers outside an execution pipeline. 

25 BRIEF DESCRIPTION OF THE DRAWINGS 

The invention will be understood more fully from the 
detailed description given below and from the accompany- 
ing drawings of embodiments of the invention which, 
however, should not be taken to limit the invention to the 
J ° specific embodiments described, but are for explanation and 
understanding only. 

FIG. 1 is a block diagram of a processor according to one 
embodiment of the invention. 
;c FIG. 2 is a How diagram of an example of two threads. 
FIG. 3 is a How diagram of another example of two 
threads. 

FIG. 4 is a How diagram of an example of six threads. 

FIG. 5 is a graph showing overlapping execution of the 
-o threads of FIG. 6. 

FIG. 6 is a block diagram illustrating individual trace 
bullers according to one embodiment of the invention. 

FIG. 7 is a block diagram of a details of certain compo- 
nents of the thread management logic and final retirement 
logic of FIG. 1 according to one embodiment of the inven- 
tion. 

FIG. $ A, SB, HC, and 8D illustrates trees to organize 
thread relationships of FIG. 4. 
50 FIG. 9 illustrates states of a state machine. 

FIG. 10 is a correlated thread predictor according to one 
embodiment of the invention. 

FIG. 11 is a simple thread predictor according to one 
embodiment of the invention. 
55 FIG. 12 illustrates nested procedures. 

FIG. 13 is a graph illustrating execution of the actual 
threads corresponding to potential threads in FIG. 12. 

FIGS. 14A, 14B. 14C, 14IX 14E, and 14E are graphical 
representations of predictor stacks according to one embodi- 
(>0 men! of the invention. 

DETAILED DESCRIPTION OF PREFERRED 
EMBODIMENTS 
A. System Overview 
( ;5 B. Thread Predictor and Training Mechanism 

1. When an Actual Thread is Retired or Reset Without 
Retirement 
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2. When a Potential Thread does or does not Meet a Test 
of Thread Goodness 
C Additional Information and Embodiments 

Incorporation by Reference: Page 3, line 12 to page 37, 
line 30 and FIGS. 1-38 of U.S. application Sen No. 08/992, 
375, Filed Dec. 16, 1997, now pending, are incorporated into 
the present specification and drawings by reference. 
However, the present invention is not limited to use in 
connection with the various examples provided in U.S. 
application Scr. No. 08/992,375. 

Reference in the specification to "one embodiment" or 
"an embodiment'' means that a particular feature, structure, 
or characteristic described in connection with the embodi- 
meni is included in at least one embodiment of the invention. 
The appearances of the phrase "in one embodiment" in 
various places in the specification are not necessarily all 
referring to the same embodiment. 

Referring to FIG. 1, a processor 50 includes an execution 
pipeline 108 in which instructions may be executed specu- 
latively. Examples of the speculation include data specula- 
tion and dependency speculation. Any of a wide variety of 
speculations may be involved. Processor 50 includes 
mechanisms, including in trace buffer 114, to detect specu- 
lation errors (misspecu la lions) and to recover from them. 
When a misspecu la t ion is detected, the misspe ciliated :< 
instruction is provided to execution pipeline 1 OS from trace 
buffers 1.1.4 through conductors .120 and is replayed in 
execution pipeline 108. If an instruction is "replayed," the 
instruction and all instructions dependent on the instruction 
are re -executed, although not necessarily simultaneously. If 30 
an instruction is "replayed in full," the instruction and all 
instructions following the instruction in program order are 
re-executed. The program order is the order the instructions 
would be executed in an in order processor. When it is 
determined that an instruction should be replayed, instruc- 35 
tions which are directly or indirectly dependent on that 
instruction are also replayed. The number of re-executions 
of instructions can be controlled by controlling ihe events 
which trigger replays. In general, the term execute may 
include original execution and re-execution. Results of at *o 
least part of the instructions are provided to the trace buffers. 
Final retirement logic 134 finally retires instructions in trace 
buffers 114 after it is assured that the instructions were 
correctly executed either originally or in re-execution. The 
invention is not restricted to any particular execution pipe- 45 
line. For example, although professor 50 is an out-of-order 
processor for intra-thread execution, the invention could be 
used with an in order processor for intra-thread execution. 
Execution pipeline .108 may be any of a wide variety 
execution pipelines and may be a section of a larger pipeline. 50 
Execution pipeline I OS may be used in connection with a 
wide variety of processors. 

As examples, the following describes certain processors 
in which the prediction and training circuitry of the present 
invention may be included. However, the invention is not 55 
restricted to use in such processors. 
A. System Overview 

1. Creation of Threads and Overview of Pipeline 

I nst met ions are provided through conductors 102 to an 
instruction cache (I-cache) 104. A decoder 106 is illustrated 60 
as receiving instructions from I-cache 104, but alternatively 
could decode instructions before they reach I-cache 104. 
Depending on the context and implementation chosen, the 
term "instructions" may include macro-operations (inacro- 
op), micro -opera tions (uops), or some other form of instruc- 65 
tions. Any of a variety of instruction sets may be used 
including, but not limited to, reduced instruction set com- 



puting (RISC) or complex instruction set computing (CISC) 
instructions. Further, decoder 106 may decode CISC instruc- 
tions to RISC instructions. Inst met ions from I-cache 104 are 
provided to pipeline 108 through MUX 110 and to trace 
buffers 114 through conductors MX. 

A trace is a set of instructions. A thread includes I he trace 
and related signals such as register values and program 
counter values. 

Thread management logic 124 creates different threads 
from a program or process in I-cache 104 by providing 
starting counts to program counters 1 12 A, .11213, . . . , 112X, 
through conductors 130 (where X represents the number of 
program counters). As an example, X may be 4 or more or 
less. Thread management logic .124 also ends threads by 
slopping the associated program counter. Thread manage- 
ment logic 124 may cause the program counter to then begin 
another thread. Portions of different threads may be concur- 
rently read from I-cache 104. 

lb determine where in a program or process to create a 
thread, thread management logic 124 may read instructions 



from decoder 106 through conductors 1.28. The'llTr^li^Tnay^ 
i n c 1 u de -i astr- u ct i o ns-i n se rie^byra^rogra m mer-or eo m p i 1 e r 
cl h a t3^ixessl y_-cl em a r ca teTtherbcgi n n iifg . .andZcnclingTcifr^ 
<nhre~acls? Alternatively, thread management logic 124 may 
$ analyze instructions of the program or process to break up a 
program or process supplied to I-cache 104 into different 
ih reads. For example, branches, loops, backward branches, 
returns, jumps, procedure calls, and function calls may be 
good points to separate threads. Thread management logic 
124 may consider the length of a potential thread, how many 
variables are involved, the number of variables that are 
common between successive threads, and other factors in 
considering where to start a thread. Thread management 
logic 124 may consider the program order in determining t he- 
boundaries of threads. The program order is the order the 
threads and the instructions within the threads would be 
executed on an in order processor. The instructions within 
the threads may be executed out of order (contrary to 
program order). The threads may be treated essentially 
independently by pipeline .108. Thread managemeni logic 
124 may include a prediction mechanism including a history 
table to avoid making less than optimal choices. For 
example, thread management logic 124 may create a thread 
and then later determine that the thread was not actually part 
of the program order. In thai case, if the same code is 
encountered again, the prediction mechanism could be used 
lo determine whether to create thai same thread again. 

Dynamically creating threads is creating threads from a 
program that was not especially written or compiled for 
multithreading, wherein at least one of the threads is depen- 
dent on another of the threads. The program may originate 
from off a chip that includes execution pipeline ION and 
thread management logic .124. Dynamically creating I he 
threads, executing the threads, and detecting and correcting 
speculation errors in the execution is referred to as dynamic 
multithreading. 

Thread management logic 124 may use a combination of 
dynamic creation of threads and use of express instruction 
hints from a compiler or programmer. Thread management 
logic 124 would ordinarily create threads dynamically. 
However, the express instruction hints could be used occa- 
sionally. One way in which they may be used is in connec- 
tion with if-then-else code in which a thread could be created 
beginning with the code following the conclusion of the else 
statement. It would lake considerable complexity for thread 
management logic 124 to look forward to these instructions, 
but can be done in a compiler. 
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An advantage of using both a compiler and hardware lor 
multithreading is that the compiler docs not have to be 
conservative. If only the compiler does multithreading, the 
compiler must he conservative to handle worst cases. If the 
hardware assists the compiler, the compiler can be more 
aggressive because the hardware will detect and replay 
instructions associated with an error such as misspeculalion. 

FIG. 2 illustrates a thread Tl that includes a conditional 
backward branch instruction. In program order, thread T2 is 
executed following the conditional branch instruction when 
the branch is not taken. In time order, thread T2 is executed 
speculatively beginning at the time thread Tl. first reaches 
the conditional branch instruction. Therefore, portions of 
thread Tl and T2 are executed concurrently. If thread T2 
involves misspeculations, the effected instructions of thread 
T2 are replayed. Thread management logic 124 may monitor 
the count of the program counters through conductors 130. 
A purpose of monitoring the count is to determine when a 
thread should end. For example, when the condition of the 
conditional branch is not met, if the program counter of 
thread Tl were allowed to continue, it would advance to the 
first instruction of thread T2. Therefore, thread management 
logic 124 stops the program counter of thread Tl when the 
condition is not met. 

FIG. 3 illustrates a thread Tl thai includes a function call 
instruction. In program order, when the call instruction is 
reached, the program counter jumps to ihe location of the 
function and executes until a return instruction, at which 
time the program counter returns to the instruction after the 
call. In program order, thread T2 begins at the instruction 
following the return. In lime order, thread T2 is executed 
speculatively beginning at the time thread Tl first reaches 
the call. If thread T2 involves misspeculations, the effected 
instructions of thread T2 are replayed. Thread Tl ends when 
its program counter reaches the first instruction of thread T2. 

FIG. 4 illustrates threads Tl, T2, T3, and T4 which are 
part of a section of a program. Different program counters 
produce threads Tl, T2. T3. and T4. Thread Tl includes 
Instructions to point A (function call instruction) and then 
from point B, to point C (conditional backward branch 
instruction), to point D and to point C again (the loop may 
be repealed several limes). Thread T2 begins ai the instruc- 
tion that in program order is immediately after the return 
instruction of the function that is called at point A. Thread 
T3 begins at the fall through instruction in memory follow- 
ing the conditional backward branch of point C and contin- 
ues to point E, lo point F, to point G, to point II, and to point 
I, which is a return instruction to the instruction immediately 
following point A where thread T2 begins. Thread 14 begins 
at the fall through instruction in memory following the 
conditional backward branch at point E. Thread T5 starts 
following a backward branch instruction at point J and 
thread T6 starts following a function call at point K. It is 
assumed that there are only four trace buffers. 

As illustrated in FIG. 5, portions of threads Tl, T2. T3, 
and T4 arc fetched, decoded, and executed concurrently. The 
threads are fetched, decoded, and executed out of order 
because the program order is not followed. In time order, 
execution of threads T2, T3, and T4 begins immediately 
following instructions at points A, C, and E, respectively. 
The vertical dashed lines show a parent-child relationship. 
Threads T2, T3, and T4 are executed speculatively by 
relying on data in registers and/or memorv locations before 
it is certain that the data is correct. Processor 50 has 
mechanisms to delect misspeculalion and cause misspecu- 
lated instructions to be replayed. It turns out that thread T4 
is not part of the program order. Thread T4 may be executed 
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until thread management logic 124 determines that thread 
T4 is not part of the program order. At that lime, thread T4 
may be reset without retirement under the control of reset 
logic 68 in thread management logic 124 and the resources 
that held or processed thread T4 in processor 50 may be 
deallocated and then allocated for another thread. In pro- 
gram order, threads Tl, T2, and T3 would be executed as 
follows: first thread TL, then thread T3, and then thread T2. 

Referring to FIG. 1, instructions from MUX 110 are 
received by rename/allocate unit 150 which provides a 
physical register identification (PR1D) of the renamed physi- 
cal register in register file 152. The PRID is pro vide.d„to-t race, 
buffer 114 through bypass conductors 126. Allocation 
^involves assigning regisTerslo ihel'nstnTcfiouTand assigning^ 
/entries pf_the,resc,r yalion stations o j^chedule/issueirni rr567/ 
Once the operands are ready for a particular instruction in 
the reservation stations, the instruction is issued to one of the 
execution units (e.g., integer, floating point) of execution 
units 158 or a memory execution pipeline which includes 
address generation unit (AGU) 172, memory order buffer 
(MOB) 178 (including load buffers 182 (one for each active 
thread) and store buffers 184 (one for each active thread)) 
and data cache 176. Depending on the instructions, operands 
may be provided from register file 152 through conductors 
168. Under one embodiment of the invention, dependent 
instructions within a thread may be so linked that they are 
not executed out-of-order. However, dependent instructions 
from different threads may be concurrently fetched, 
decoded, and executed out-of-order. 

For high performance, reservation stations and related 
mechanisms are designed to have both low latency and high 
bandwidth issue of instructions. The latency and bandwidth 
requirements place restrictions on the number of instructions 
that can be waiting in the reservation stations, By position- 
ing trace buffers 114 outside pipeline 108, a large number of 
instructions can be available for execution/replay without 
significantly decreasing throughput of pipeline 108. The 
effect of latency between execution pipeline 108 and trace 
buffers 114 can be reduced through pipelining. 

The result of an execution and related information are 
written back from the memory through MUX 192 and 
conductors 196 (in the case of loads) and from writeback 
unit L62 through conductors 122 (in the case of other 
instructions) lo trace buffers 114. The results and related 
information may also be written to register file L52 and 
associated re -order buffer (ROB) 164. Once the result and 
information of an instruction are written to register file 152 
and ROB 164, the instruction is retired in order as far as 
pipeline L08 is concerned. This retirement is called a first 
level or initial retirement. At or before the first level 
retirement, resources for the retired instruction in schedule/ 
issue unit 156 including the reservation stations, register file 
1.52, and ROB 164 are deallocated. However, all needed 
details regarding the instruction are maintained in trace 
buffers 114 and MOB 178 until a final retirement, described 
below. 

A dependency exists between a later thread and an earlier 
thread when in program order, data used in the later thread 
is produced in the earlier thread. The data may have been 
produced in the earlier thread through a memory or non- 
memory instruction. For example, the later thread may be 
dependent on the earlier thread if a load instruction in the 
later thread has the same address as a store instruction in the 
earlier thread. The later thread may also be dependent on the 
earlier thread if an instruction in the later thread involves a 
register that was modified in the earlier thread. Likewise, a 
later instruction is dependent on an earlier instruction when 
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in program order the later instruction uses data produced by 
the earlier instruction. The word "dependency 1 ' is also used 
in the phrase "dependency speculation." An example of a 
dependency speculation is speculating that there is no 
dependency between a load instruction and an earlier store 
instruction. Address matching is an example of a technique 
for checking for dependency speculation errors. An example 
of data speculation is speculating that the data in a register 
is I he correct data. Register matching is an example of a 
technique for checking for data speculation errors. 

Referring to FIG. 6, trace buffers 114 include trace buffers 
114A, 11413, 114C . . . , 1L4Y, where Y presents the number 
of trace buffers. For example, if Y=4(i.e. : Y-D), there are 4 
trace buffers. If Y is less than 3. trace buffers 1.14 does noi 
include all the trace buffers shown in FIG. 6. Y may be the 
same as or different than X (the number of program 
counters). Trace buffers 114 may be a single memory 
divided into individual trace buffers, or physically separate 
trace buffers, or some combination of the two. 

Referring to FIG. 6, trace buffers 114A, 11413, . . . , 1.14Y 
receive instructions through conductors USA, MSB, . . . 
1 1 XV, which are connected to conductors 118. There may be 
demultiplexing circuilry between conductors USA, 
1 IS 13, . . . , 11 8 Y and conductors 11 S. Alternatively, enable 
signals may control which trace buffer is activated. Stil! 
alternatively, there may be enough parallel conductors to 
handle parallel transactions. Trace buffers 114 A, 11413, 
1.14Y supply instructions and related information for replay 
to pipeline 10S through conductors 120A, 12013, . . . 120Y, 
which are connected to conductors 120. It is noted that 
multiple instructions from trace buffers 114 may concur- 
rently pass through conductors 120 and MUX 110 for 
re-execution. At the same time, multiple instructions from 
decoder 106 may also pass through MUX 110 for the first 
time. A thread ID and instruction ID (insir ID) accompany 
each instruction through the pipeline. A replay count may 
also accompany the instruction. In the case of load and store 
instructions, a load buffer ID (LI3ID) and a store buffer ID 
(S13ID) may also accompany the instruction. In one 
embodiment, the LB ID and SB ID accompany every 
instruction, although the I..BID and SI3ID values may be 
meaningless in the case of instructions which are not loads 
or stores. As described below, a PRID or value may also 
accompany an instruction being re -executed. 

Trace buffers 114A, 11 413, 1 . . , 1 1 4 Y receive PRID, 
I.J3ID, and SB ID values from rename/allocate unit 150 
through bypass conductors 126A, 12613, . . . 126Y, which are 
connected to conductors 126. Trace buffers 11 4A, 11 413, . . . , 
114Y receive writeback result information and related sig- 
nals ihrough conductors 122A, 12213, . . . , 122 Y, which are 
connected to conductors 122, and through conductors 196A, 
19613, . . . , 196Y. which are connected to conductors 196. 
Replay signals are provided through conductors 1.94A, 
19413, . . . , 194Y, which are connected to conductors 194. 
Multiplexing and/or enable circuilry and/or a substantial 
number of parallel conductors may be used in conductors 
120, 126, 122, 194, and 196. The trace buffers are not 
necessarily identical. 

2. Misspeculation Detection Circuitry 

'The following provides examples of misspeculation 
detection circuitry, but the invention is not limited to use 
with such detection circuitry. Trace buffers 114 include 
detection circuitry to detect certain speculation errors. 
According to one embodiment of the invention, each trace 
buffer has an output register file that holds the register 
context of the associated thread and an input register hie to 
receive the register context of the parent thread output 



register. The register context is the contents or state oT the 
logical registers. 'The contents of the output register file is 
updated often, perhaps each lime there is a change in a 
register. TTie contents of the input register file is initialized 
when a thread is created and updated only after a 
comparison, described below. 

An output register file and input register file may include 
a Value or PRID field and a status field. 'The status field 
indicates whether a valid value or a valid PRID is held in the 
Value or PRID field. A comparator compares the contents of 
input register file (in trace buffer 11413) for a curreni thread 
with the contents of output registers for an iin mediately 
preceding thread in program order. The comparison can be 
made at the end of the execution of I he immediately pre- 
ceding thread or during the execution of the preceding 
thread. The comparison is also made at the end of the 
retirement of the preceding thread. In one embodiment, the 
comparison is only made at the end of the retirement of the 
preceding thread. In another embodiment, the comparison is 
also made at intermediate limes. In other embodiments, the 
comparison is made continuously during final retirement of 
the preceding thread. 

Various events could trigger a comparison by the com- 
parator. The comparison is made to detect speculation errors. 
If there is a difference between the speculative input register 

9 file and the output register file from the preceding thread, 
values of one or more output registers of the immediately 
preceding thread have changed and a replay is triggered. 
'Trace buffer 11413 causes the effected instructions to be 
replayed with the changed register values. There is no 
guarantee the changed values are the ultimately correct 
values (i.e., the values that would have been produced in an 
in order processor). The instructions may need to be 
replayed again, perhaps several limes. 

When replay triggering logic determines that a source 
operand (or other input value) has been mispredicted, it 
triggers the corresponding trace buffer (such as trace buffer 
114B) to dispatch those instructions that are directly or 
indirectly dependent on the mispredicted source operand to 
be replayed in pipeline IDS. 

In one embodiment, an algorithm that identifies depen- 
dent instructions in a way that is similar to register renam- 
ing. Instead of a mapping table, a dependency table is used. 
'Hie table contains a Hag for each logical register that 
indicates if the register depends on the mispredicted value. 
At the start of a recovery sequence, the flag corresponding 

10 each mispredicted register is set. The dependency table is 
checked as instructions are read from the trace buffer. If one 
of the source operand flags is set. an instruction depends on 
the mispredicted value. The iastruction is selected for recov- 
ery dispatch and its destination register flag is set in the 
table. Otherwise, the destination register flag is cleared. 
Only the table output for the most significant instruction 
within the block can be relied upon all the lime. Subsequent 
instructions may depend on instructions ahead of them 
within the block. A bypass at the table output is needed to 
handle internal block dependencies. If the dependency table 
reaches a state during the recovery sequence in which all the 
dependency flags are clear, the recovery sequence is termi- 
nated. 

The identified inst ructions are dispatched from the trace 
buffer for execution in the order the instructions exist in the 
trace buffer (which is the program order). However, the 
instructions may be executed out of order under the control 
of schedule/issue unit 156, as in any out-of-order processor. 
Control bits are appended to the instruction dispatched from 
the trace buffer to indicate to rename/allocate unit 150 
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whether lo (1) do register renaming, (2) bypass the rename 
alias table lookup in rename/allocate unit 150 and instead 
use the PR ID front the corresponding trace buffer, or (3) 
bypass renaming completely and use the value from the 
trace buffer as if it where a constant operand in the insirue- 5 
tion. 

3. Information Regarding Thread Management Logic and 
Pinal Retirement 

llie following are some details regarding thread manage- 
ment logic and linal retirement that may be used in conncc- . J(] 
lion with some embodiment of the invention. However, the 
invention is not limited to such details. In one embodiment, 
thread ma 11 a genie ni logic 124 uses a tree structure (such ns 
tree structure 58 in PIG. 7) to keep track of thread order. 
Under the tree structure, the program order (which is also the 
retirement order) flows from top to bottom, and a node on 15 
the right is earlier in program order than a node on the left. 
A root is the first in program order. A tree is an abstract idea, 
whereas a tree structure is circuitry that implements the tree. 

Por example, PIGS. 8 A, SB, 8C, and 8D illustrate trees 
structures for threads of PIG. 4 at different limes. PIG. 8 A 20 
illustrates the tree structure at time tl. Thread T2 is added to 
the tree before thread T3 is added to the tree. Thread T4 is 
added to the tree after thread 13 is added to the tree. Threads 
T2 and T3 are children of thread Tl . Thread 74 is a child of 
thread T3. Following the rules of lop to bottom and right to 25 
left, the program and retirement orders are thread TL, T3,T'l, 
and T2. FIG. 8B illustraies ihe tree structure at lime 12 
assuming that thread T4 is reset before thread Tl retires. The 
program and retirement orders are thread IT, T3. 12, and T5. 
PIG. 8C illustrates the tree structure at lime t2 assuming that 30 
thread Tl retires before thread T4 is reset. The program and 
retirement orders are thread T3, T4, T2. and T5. FIG. 8D 
illustrates the tree structure at lime t3, which is after the time 
thread Tl retires and thread T4 is reset. The program, and 
retirement orders are T3, 12, 15 and T6. 35 

111 reads begin at the instruction following a backward 
branch or a function call. That is, threads begin at 1 he next 
instruction assuming the backward branch were not taken or 
the function was not called (as illustrated by threads T2 in 
FIGS. 2 and 3). In so doing, from the perspective of a thread ^0 
(node), the program order of children nodes of the thread are 
in the reverse of the order in which the threads were started 
(created). 

In one embodiment, three events may cause a thread to be 
removed from the tree: ( 1 ) A thread at the root of the tree is 45 
removed when the thread is retired. When the thread at the 
root is retired, the thread (node) that is next in program order 
becomes the root and nodes are reassigned accordingly. (2) 
A thread that is last in program order is removed from the 
tree to make room for a thread higher in program order lo be 50 
added to the tree. In this respect, the tree acts as a lasl-in- 
firsl-out (UFO) stack. (3) A thread may be reset and thereby 
removed from the tree when it is discovered that the program 
counter of its parent thread is outside a range between a start 
count and an end count. 55 

An instruction is finally retired from trace buffer 114 when 
all the instructions for all previous threads have retired and 
all replay events that belong to the instruction have been 
serviced. Staled another way, an instruction is finally retired 
when it can be assured that the instruction has been executed 60 
with the correct source operand. Threads are retired in order. 
For example, an instruction in thread X cannot be retired 
until all the previous threads have been retired (i.e., the 
instructions of all the previous threads have been retired). 
The instructions within a thread arc retired in order, although 65 
instructions 1 hat are all ready for re lire mem may he retired 
simultaneously. 
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Final retirement is controlled by final retirement logic 
134. In one embodiment of the invention, final retirement 
includes (I.) commitment of results to in order register file, 
(2) service interrupts, exceptions, and/or branch mispredic- 
tions; (3) deallocation of trace buffer and MO B 178 resource 
entries; and (4) signaling the MOB to mark stores as retired 
and to issue then to memory. Deallocating entries may 
involve moving a head pointer. Store instructions in MOB 
178 are not deallocated until after it is certain that associated 
data is copied to data cache 176 or other memory. 

A. Memory Order Buffer 

The invention is not limited to any particular type of 
memory order buffer (MOB). However, to the extent load 
and store instructions are executed speculatively, there 
should L>e a mechanism to detect niisspeeulaiions and lo 
allow for replay of effected instructions. For example, there 
may be a mechanism to delect when a load is made from a 
memory location before a store to the same location (which 
occurred earlier in program order but later in time order). 
Por example, MOB 178 includes a MOB for each thread, 
where each MOB includes a load buffer arid store buffer lo 
hold copies of load and store insiruetions of the traces in 
trace buffers 1.14 A, 114B, . . . , I14Y. respectively. Replay 
conductors 194 provide signals from MOB 178 to trace 
buffers 114 alerting trace buffers 1 14 thai a load instruction 
should be replayed. To ensure ultimate correctness of 
execution, MOB 178 includes mechanisms to ensure 
memory data coherency between l breads. 
B. Thread Predictor and Training Mechanism 

Referring to PIGS. 1 and 7, processor 50 includes thread 
predictor 54 to make predictions regarding whether a thread 
should be created in response to reception of a thread 
spawning opportunity instruction. Examples of spawning 
opportunity instructions include function calls, backward 
branches for loops, and compiler inserted instructions sug- 
gesting or mandating spawning of a thread. (Spawning a 
thread means creating a thread.) Thread predictor 54 is used 
by thread management logic to increase the likelihood that 
executed threads will retire. The invention also includes a 
training mechanism 56 to train thread predictor 54 to help 
decrease the probability that thread management logic 124 
will create threads that have relatively few instructions or 
relatively little overlap with the spawning thread (the thread 
with the spawning opportunity instruction). ITiese predic- 
tions help increase processor performance. 

The following terminology is used with respect to one 
embodiment of the invention. An actual thread is a thread 
that is created through a program counter and executed at 
least in part. An actual thread is either retired or reset without 
retirement. A thread creation opportunity is an opportunity 
to create an actual thread. Thread management logic 124 
decides whether or not lo take the thread creation opportu- 
nity and create an actual ihread in response 10 detecting a 
spawning opportunity instruction from decoder 106 (or from 
other circuitry). A potential ihread is a group of retired 
insiruetions analyzed in a predictor training mechanism. The 
potential ihread is associated with a retired spawning oppor- 
tunity instruction. ITie set of instructions included in a 
potential thread may or may not correspond exactly 10 a set 
of instructions in a recently retired actual ihread. Although 
the specification does not always explicitly identify a Ihread 
as being actual or potential, it is clear from the context. 

Thread predictor 54 includes numerous slate machines 
corresponding to different spawning opportunity instruc- 
tions. (In some embodiments, more than one stale machine 
may correspond to the same spawning opportunity 
instruction, such with the correlated predictor described 
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below.) The slate machines may share some circuitry in 
common. PIG. 9 illustrates slates 0, 1, 2, .... 7 of state 
machine. As an example, states 0-3 may be not take slates 
and states 4-7 may he take states (although the number of 
lake and not lake states do not have to be equal). A lake slate 5 
means a ihreaci creation opportunity is to be taken and a not 
take slate means the thread creation opportunity is not to be 
taken. When the slate machine is updated in ihe lake 
direction, ihe stale changes from a lower to a higher slate, 
unless the state is already at stale 7, in which case il remains :io 
at slate 7. When the slate machine is updated in Ihe not lake 
direction, ihe stale changes to from a higher to a lower state, 
unless ihe state is already at stale 0, in which case il remains 
ai stale 0. Examples of updating in ihe take direction include 
stale 0 to state 1, slate 3 to state 4, and state 7 to State 7. 15 
Examples of updating in the not take direction includes state 
6 to state 5. stale 4 to state 3, and stale 0 to state 0. As an 
example, if the state machine included three bit saturated 
counters, stales 0-7 might be binary stales "000," "001," 
"010. :: "Oil," 11 LOO," "101./" "110/" and "111," respectively 20 

A state machine are updated in response to two events (1) 
an actual thread is retired or reset without retirement or (2) 
a potential thread does or does not meet a test of thread 
goodness. If the state machine is a counter, updating may 
include increasing or decreasing a count. However, updating 25 
a state machine does not always result in a change of the 
state of the stale machine (e.g., the slate in a saturating 
counter state does not go above or below certain counts). 

1. When an Actual Thread is Retired or Reset Without 
Retirement 30 

When an actual thread is retired, that is evidence that the 
thread would retire again if it were created again in response 
to the same spawning opportunity instruction, particularly if 
it has the same thread lineage. The thread lineage of a ihread 
of interest includes a parent thread and perhaps a grandpar- 35 
em ihread. Hie parent thread is the thread containing the 
spawning opportunity instruction. The grandparent thread is 
the parent of the parent. The thread lineage is represented as 
address bits of the thread of interest, parent, and perhaps 
grandpa rem. However, noi all ihe bits need be used (e.g., an 
only part of the patent and grandparent bits might be used) 
and the bits that are used may be acted on by some function 
to help avoid aliasing with respect to the thread predictor. 

In response to an actual Ihread retiring, a signal is 
provided to ihread predictor 54 to update the stale machine 45 
(e.g., incrementing the slate of the stale machine) to indicate 
a successfully taken thread creation opportunity. However, 
in general, there is no guarantee the thread would retire 
again if it were executed again. If a Ihread is reset, that is 
evidence that a thread would be reset again if it were created 50 
again, particularly if it has the same lineage. Accordingly, in 
response to an actual thread being reset, a signal is provided 
to ihread predict or 54 to update the stale machine (e.g., 
decrementing the state) to indicated a unsuccessfully taken 
thread creation opportunity. However, in general, there is no 55 
guarantee the ihread would be reset if executed again. 

Referring to FIG. 7, in one embodiment, thread predictor 
54 includes a correlated predictor 62 and a simple predictor 
64 (although in other embodiments only a correlated pre- 
dictor or only a simple predictor might be used). The 60 
following provides details of specific examples of correlated 
and simple predictors. However, the invention may be 
implemented with different details. 

Correlated predictor 62 and/or simple predictor 64 are 
updated in response to an actual ihread retiring or reselling 65 
and are updated in response to a potential ihread meeting or 
not meeting a test of thread goodness. Correlated predictor 
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62 and/or simple predictor 64 are read in response to a thread 
creation opportunity being delected. Referring to FIG. 10, a 
correlated predictor 62 includes a thread history table 512 
including 2' count registers 516-1, 516-2, . . . 516-2'. which 
may be part of finite stale machines (e.g., up/down saturating 
counters). When a thread is retired or resei, a signal is 
provided on conductors 142 (as shown in FIG. 7) to update 
the state of correlated predictor 62 by updating ihe count in 
Ihe count register that corresponds lo the actual ihread that 
is retired or reset. A read-modify -write approach may be 
used lo update the value of the appropriate count register. 
Logic in ihread management logic 124 controls updating and 
reading of slates. When a spawning opportunity instruction 
is detected, ihread management logic 124 reads ihe corre- 
sponding count register to determine whether or not to lake 
the ihread creation opportunity based at least in part on the 
state of correlated predictor. For example, if the count in the 
count register has one relationship (e.g.,>ori0 to a 
threshold, correlated predictor 62 is in a "lake slate" and the 
actual ihread is created. If the count has another relationship 
(e.g.,<or = ) to the threshold, correlated predictor 62 is in a 
"riot take slate'' and an actual Ihread is not created. In one 
embodiment, even if the state is a "take slate," thread 
management logic 124 will nol cause a thread lo be ere a led 
if it means resetting a thread earlier in program order. In 
another embodiment, the thread will be created even if it 
means resetting a thread earlier in program order. 

In one embodiment, a particular one of the count registers 
516 of Ihread history table 512 is indexed through a signal 
on conductors 72, which is the result of an exclusive-OR 
(XOR) function 508. One input to XOR function 508 is the 
J least significant bits (LSBs) of an address on conductors 66 
representing an actual thread (for updating a count register 
following retirement or reset), a potential thread (for updat- 
ing a count register following a determination of whether a 
potential thread meets the test), or thread creation opportu- 
nity (for reading a count register), depending on the situa- 
tion. The address provided to XOR function 508 does not 
necessarily have lo be the first address of an actual ihread, 
retired ihread, or thread creation oppon unity. 

Another input is a ihread lineage from history register 
504. The bits in history register 5(14 may include some 
function of bits of the parent and grandparent of actual 
thread, potential thread, or ihread creation opportunity. For 
example, ihe function may be the X LSBs from the first 
address of the parent and fewer LSBs from the first address 
of the grandparent thread of the thread creation opportunity, 
either concatenated or combined through, for example, an 
XOR function. In the case of a retired or reset actual thread, 
Ihe thread lineage value for history register 504 may come 
from the tree structure. Each time a new ihread is added to 
the tree structure, a thread lineage field associated with ihe 
iree structure can hold ihe thread lineage value associated 
with the thread. Then when the thread leaves the tree 
structure, the ihread lineage value can be written into history 
register 504. In the case of a thread creation opportunity, the 
thread lineage value to Lie written into history register 504 
can be determined from the address of the thread creation 
opportunity and the parent and perhaps grandpa rem in the 
Iree structure. In the case of a potential ihread, the thread 
lineage value to be written into history register 504 can be 
copied from a stack described below in connection with 
potential threads. 

XOR function 508 is used to reduce the chances of 
aliasing in which the LSBs of the addresses of two ihread 
creation opportunities and their parent and grandparent 
threads have similar values. An XOR function 508 does not 
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have lo be used. For example, the address of ihe thread 
creation opportunity could index a row and history register 
504 could index a column of count registers. Various 
arrangements may be used. 

A reason for using bits from the parent and grandparent 5 
threads is that there may be a correlation between the lineage 
of a thread creation opportunity and whether the thread 
creation opportunity will be retired if it is spawned. 

Referring to FIG. .11, as an example, simple predictor 64 
includes a thread history table 522 which includes 2 A count :io 
registers. (K may be equal to or different than J.) An address 
representing the actual thread, potential thread, or thread 
opportunity is received on conductors 76. The count regis- 
ters arc updated and read as described above in connection 
with the correlated predictor. Initially, thread management J5 
logic 124 might only consider the counts of simple predictor 
64. Logic in thread management logic 124 controls updating 
and reading of simple predictor 64 and decides whether lo 
create a thread based ai least in part on the state of simple 
predictor 64 (e.g.. the counts in the appropriate count 20 
registers). Then, as counts of correlated predictor 62 become 
meaningful, the counts of simple predictor 64 might be 
ignored. 

2, When a Potential Thread does or does not Meet a Test 
of Thread Goodness 15 

When an actual thread is executed, it will be known 
whether ihe thread is retired or reset without retirement. 
However, when a thread creation opportunity is not taken, 
the retirement or resetting of the no n -created thread cannot 
be observed. Nonetheless, predictor training mechanism 56 }Q 
analyzes retired instructions to make a prediction of whether 
it would be a good use of processor resources to create an 
actual thread from a potential thread. 

Referring to FIG. 7, in one embodiment, final retirement 
logic 134 includes predictor training mechanism 56 which 35 
provides signals indicating whether potential threads meet a 
test of thread goodness. 11: the test of thread goodness is met, 
correlated predictor 62 atid/or simple predictor 64 of thread 
predictor 54 are updated in a lake direction (e.g., the count 
is incremented unless it is already a maximum). If the test is ^0 
not met, correlated predictor 62 and/or simple predictor 64 
are updated in a not lake direction (e.g., the count is 
decremented unless it is a minimum). Referring to FIGS. 10 
and 11, in one embodiment, the address of a thread spawning 
opportunity instruction of the potential thread is used to 45 
index thread history tables 512 and 522. In the case of 
correlated predictor 62 in FIG. 11, the potential thread 
address is applied lo an input to XOR function 508. History 
register 504 holds bits related to the lineage of the potential 
threads, 50 

In one embodiment a potential thread is not analyzed if 
it's spawning opportunity instruction is the spawning oppor- 
tunity instruction of an actual thread that is retiring or has 
just retired. In another embodiment, the potential thread is 
analyzed even if its spawning opportunity instruction is the 55 
spawning opportunity instruction of an actual thread that is 
retiring or has recently retired. In such a case, thread 
predictor 54 might only be updated in response lo cither (1) 
retirement or resetting of the actual thread or (2) the poten- 
tial thread meeting or not meeting the test of thread 00 
goodness, but not (1) and (2). In another embodiment, thread 
predictor 54 might be updated twice in response to both (1) 
and (2). 

Predictor training mechanism 56 includes a retired thread 
monitor 82 to analyze retired instructions and to identify 65 
potential threads in the retired instructions in the same way 
that thread management logic 124 identifies thread creation 



opportunities in instructions. Predictor training mechanism 
56 determines whether the potential threads meet a lest of 
thread goodness. 

In one embodiment, the test of thread goodness is met if 
certain criteria of thread goodness are met. Examples of such 
criteria include criteria (L), (2). and (3), described below. In 
another embodiment, there is only one criterion of thread 
goodness (e.g., criterion (1)) and the test is met if the 
criterion is met. In yet another embodiment, there are two 
criteria (which may include one or more of criteria (1), (2), 
and (3)), and the lest is met if the two criteria are met. There 
may be more than three criteria. In still another embodiment, 
at least one but fewer than all of multiple criteria must be 
met for the test to be met. 'The test may change depending 
on circumstances. 

FIG. 12 illustrates an example with which criteria (J), (2), 
and (3) may be described. FIG. 12 illustrates potential 
threads TO, Tl, T2, T3, and T4 which include nested 
functions or procedures initialed by calls call 1, call 2, call 
3, and call 4. It is assumed there are only four trace buffers 
1I4A, 1141}, 114C and 114D in processor 50. The program 
and retirement order is TO, T4, T3, and Tl. Without the 
predictor of the invention, the lime order would be TO, Tl, 
T2, T3 and ihen Tl would be reset to make room for T4. 
However, with the invention, it can be predicted that Tl 
would be reset (because it fails criterion (J) discussed 
below) so that if each of threads T0-T4 is a cm ally created 
and the Tl thread creation opportunity instruction is pre- 
dicted "not taken," in thread predictor 54, the time order 
would be TO, T2, 13, and T4. Referring lo FIG. 13, in the 
example, thread Tl would not be created, but rather would 
be included as part thread T2. 

A retired instruction counter N6 is incremented as instruc- 
tions are retired. Table 1 , below, lists the spawning oppor- 
tunity instruction and joining opportunity instruction of 
threads T1-T4 and the count of instruction counter 86 at the 
instruction, where C1<C2<C3. . .<C8. In one embodiment, 
the joining opportunity instruction is the instruction of the 
previous potential thread in program order that joins with the 
potential thread of interest. For example, return J. is the 
joining instruction of potential thread T4. 

TABLE 1 
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instruction 
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C2 


Return 3 


CI 
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Return 2 
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T4 


call 4 


C4 


Return ] 


C5 



Criterion (1) Number of potential threads between spawn 
and join (stack depth) Criterion (J ) concerns whether it 
would be likely that a potential thread, which is actually 
spawned, would be reset before retiring to make room for a 
thread earlier in program order. If so, the potential thread 
does not meet the criterion (1). Referring to FIGS. 7 and 
14A-14F, a stack S4 may be used to keep track of a potential 
thread distance count. (Slack 84 is illustrated with only four 
entries., but may have more.) In one embodiment, slack 84 
includes fields for a program counter (PC) value of the First 
address of the potential thread of interest; an indicia of the 
thread (e.g., Tl); a distance count (di.st cut) which represents 
the number of intervening potential threads (note thai the 
value changes as additional spawning opportunity instruc- 
tions are received); a spawn count, which is the count of 
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retired instruction counter 86 associated with the spawning 
opportunity instruction; and a join count, which is the count 
of retired instruction counter 86 associated with a joining 
opportunity instruction. Stack 84 may include others fields 
such as whether the spawning opportunity function is a call, 
backward branch, or special compiler instruction. A tempo- 
rary register 144 holds the information regarding the poten- 
tial thread most recently popped oil' slack 84. The thread 
lineage can be determined from the PCs of the potential 
thread of interest and its parent and perhaps grandparent or 
stack 84 may include a lield (as part of, in addition to, or in 
place of the PC) to hold the thread lineage. 

Referring to PIGS. 12 and 1.4 A, as the call 1 instruction 
is delected, information (e.g., PC value, ihread indicia, and 
spawn count) regarding thread Tl is pushed onto stack 84. 
A "J" is placed in a distance counter (dist cm) lield. 
Referring to PIG. 14B, as the call 2 instruction is detected, 
information regarding thread T2 is pushed onto the stack and 
the information regarding thread 'PI is pushed deeper into 
stack 84. A"l" is placed in the distance count field of thread 
*P2 and the distance count value of thread Pi is incremented 
to a "2". Referring to PIGS. 14C and 14D, as the call 3 and 
call 4 instructions are detected, information regarding 
threads T3 and T4 are pushed onto the stack in order, and the 
information regarding thread Tl and T2 are pushed deeper. 
The distance count fields are incremented as new spawning 
opportunity instructions are received. 

When a potential thread is joined by a thread earlier in 
program order, the information regarding the potential 
thread is popped off the stack. As threads are popped off the 
stack, the distance counts do not decrement, except as 
described below. Referring to PIG. 14E, after the return 1 
instruction is received, Ihread TO joins thread T4 and the 
information regarding thread T4 is popped off stack 84. The 
distance count of thread T4 is "T : which is less than the 
threshold of 4, so T4 meets criterion (1). Referring to FIG. 
I4F. after the return 2 instruction is received, thread T4 joins 
thread T3 and the information regarding ihread T3 is popped 
off stack 84. The distance count of thread T3 is ' k 2" which 
is less than the threshold of 4, so ihread T4 meets criterion 

Note, however, the distance count of thread Tl is 4, which 
is not less than the threshold of 4, so thai thread Tl does not 
meet criierion (1) of thread goodness. In essence, this is 
because each of the four trace buffers are likely to be used 
by threads TO, T2, T3 ; and T4 so thai if ihread Tl were 
created, it would likely be reset to make room for one of 
threads T2, T3. orT4, which would resuli in a performance 
penalty. 

Criterion (2) Number of instructions between spawn to 
join (amount of overlapping execution). If the number of 
instructions between the spawning opportunity instruction 
of the poteniial thread of interesi and the joining instruction 
of the spawning potential thread (which contains the spawn- 
ing opportunity instruction) to potential thread of interest is 
relatively small, there will not be much concurrent execution 
of the spawning thread and the potential thread of interest. 
In such a case, there will not be much performance benefit 
and it probably will not be worth spawning the potential 
thread of interest. Rather, when it is executed, the instruc- 
tions of the poteniial thread of interesi could be part of the 
spawning poteniial thread when executed as an aeiual 
Ihread. Por example, referring lo PIG. 12. if the number of 
instructions between call 4, and the first instruction of thread 
'P4 is small, it probably would not be a good use of resources 
to spawn thread T4. Rather, thread T4 could be part of ihread 
TO. 



In one embodiment, the amount of overlap (or concurrent 
cxccuiion) of actual threads is estimated by the difference of 
the join count and the spawn count of a potential thread. 
'Phese values are provided in lie Ids of slack 84 and the 
difference is shown in Table 2, below: 

TABLE 2 

Overlapping l-xecuiion of Size of Thread 
Spawning Thread and Potential (Conn: of next joining 

Thread of Interest (Count opportunity instruction 

of joining opportunity minus count of joining 

Potent in I instruction minus count of opportunity instruction of 

Thread spawning opportunity potential thread 

of Interest instruction of interest) 



Tl 
T2 



CS - CI 
C7 - C'l 
CtS - C3 
C5 - C4 



[Vol known in I he example 
CS - C7 
C7 - C6 
C6 - C5 



In one embodiment, the difference is compared to a 
threshold value to determine if criterion (2) of thread good- 
ness is met. Criterion (2) is noi met if the difference has a 
first relationship (e.g.. <or^) to the threshold value and is 
met if the difference has a second relationship (>or^) to the 
threshold value. One factor in determining the magnitude of 
the threshold value is the number of cycles taken to create 
and lo retire or reset a ihread. 

Criterion (3) Number of instructions from join to next join 
(size of potential thread of interest). If the number of 
instructions of a potential thread of interesi is relatively 
small, it is probably not worth spawning it as an actual 
thread. The size of the potential Ihread is compared to a 
threshold value to determine if criterion (3) of thread good- 
ness is met. Criterion (3) is not met if the size has a first 
relationship (e.g.,<or^) to the threshold value and is met if 
the size has a second relationship (>or^) to the threshold 
value. One factor in determining the magnitude of the 
threshold value is the number of cycles taken to create and 
lo retire or reset a thread. Nole that the size is only an 
estimate of the size of an actual thread. 

In one embodiment, ihe size of the potential thread of 
interest is the difference of the join count of the next ihread 
in program order and the join count of the potential thread 
of interest. As mentioned, the join count is the count of 
retired instruction counter 86 at the joining opportunity 
instruction. 'Hie temporary register 144 is used because the 
join count of the later poteniial thread is noi known until 
after I he poteniial thread of the interest is popped off slack 
84 and placed in temporary register 144. Por example, in 
PIG. 14E, information regarding thread T4 is held in tem- 
porary register 144. The size of thread T4 is C6-CS. where 
C6 is in slack 84 and C5 is in the temporary register. Values 
for the thread distance are in Table 2, above. 

Referring lo PIGS. 1.4 D and 1.4E, as illustrated, the 
distance count values do not decrement when the thread at 
the lop of ihe stack is joined. However, in one embodiment, 
if the thread does noi meet criteria (2) and (3), the distance 
count values of the potential threads are decremented. 

In a more sophisticated and comp Healed mechanism, for 
criteria (2) and (3), retired thread monitor 82 could consider 
Ihe type of instructions as well as the number of instructions. 

In one embodiment, when a state machine is updated in 
response to retirement or resetting of an actual thread, the 
slate is changed by Iwo states (e.g., state 3 to state 5, state 
6 to stale 4, state 0 lo slate 0), but when the slate machine 
is updated in response to a potential ihread meeling or not 
meeting the test, it is only changed by one state (e.g., slate 
3 lo stale 4, slate 7 to state 7). 
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Although the example of FIG. 12 concerns functions, the 
criteria of thread goodness may also he determined for 
threads having a backward branch spawning opportunity 
instruction. For calls, a join may be determined by observing 
the instruction statically following the spawning opportunity 5 
instruction. For loops, it may be determined by observing 
the loops exit (where the current retired instruction is 
between the bounds of the start and end of the loop and the 
next retired instruction is outside these bounds and the next 
retired instruction is not a call). jo 
C. Additional Information and Embodiments 

Referring to FIG. .1, in one embodiment, processor 50 
may include a branch prediction unit 60 as part of a fetch 
unit that includes decoder 1.06 and program counters (PCs) 
112 A, I IB, . . . , 112X. Branch prediction unit 60 may J 5 
include a branch predictor having a correlated predictor 
and/or a simple predictor. If a branch is correctly predicted 
taken, the predictor is incremented and if a branch is 
incorrectly predicted taken, the corresponding count register 
is decrement cel. The correlated predictor may contain a 20 
branch history shift register that is shifted with each branch. 

In one embodiment, there is a return address slack for 
each thread, to provide a return address from functions. 
More than one return address stack may be used because in 
a processor that executes threads out of order, the return 25 
from a function in time order may occur before an earlier 
call in the program. Of course in program order, the call 
statement comes first. The contents of the return address 
stack of the spawning thread at the lime of the spawn is 
copied into the return address stack of the spawned thread }Q 
(i.e., the thread created in response to the spawning oppor- 
tunity instruction in the spawning thread). Following the 
copy, returned addresses are pushed and popped on and oil 
of the return address stacks of the spawned and spawning 
threads independently. 35 

Hie circuits and details that are described and illustrated 
are only exemplary. Various other circuits and details could 
be used in their place. Further, there may be various design 
tradeoffs in size, latency, etc. There may be intermediate 
structure (such as a buffer) or signals between two illustrated ^0 
structures. Some conductors may not be continuous as 
illustrated, but rather be broken up by intermediate structure. 
The borders of the boxes in the figures are for illustrative 
purposes. An actual device would not have to include 
components with such defined boundaries. For example, a 45 
portion of the final retirement logic could be included in the 
boxes labelled trace buffers. The relative size of the illus- 
trated components is not to suggest actual relative sizes. 
Arrows show certain data How in certain embodiments, but 
not every signal, such as data requests. Where a logical high 50 
signal is described above, it could be replaced by a logical 
low signal and vice versa. 

'Hie in si met ions within a thread may pass from decoder 
106 to trace buffers 114 and execution unit 10S entirely in 
program order or in something other than program order. 55 

The terms "connected," "coupled," and related terms are 
not limited to a direct connection or a direct coupling, but 
may include indirect connection or indirect coupling. The 
term "responsive" and related terms mean that one signal or 
event is influenced lo some extent by another signal or event, 60 
but not necessarily completely or directly. If the specifica- 
tion states a component "may", "could", or "might" to be 
included, that particular component is not required to be 
included. 

'lliose skilled in the art having the benefit of this disclo- 65 
sure will appreciate that many other variations from the 
foregoing description and drawings may be made within the 



scope of the present invention. Accordingly, it is the fol- 
lowing claims including any amendments thereto that define 
the scope of the invention. 
What is claimed is: 

1. A processor comprising: 

thread management logic including a thread predictor 
having stale machines to indicate whether thread cre- 
ation opportunities should be taken or not taken; and 

a predictor training mechanism to receive retired instruc- 
tions and to identify potential threads from the retired 
instructions and to determine whether a potential thread 
of interest meets a lest of thread goodness, and if the 
test is met, one of the state machines that is associated 
with the potential thread of interest is updated in a take 
direction, and if the lest is not met, the state machine is 
updated in a not take direction. 

2. The processor of claim 1, wherein the predictor training 
mechanism determines whether the potential thread of inter- 
est meets one criterion of thread goodness and the test is met 
if the criterion of thread goodness is met. 

3. The processor of claim 1, wherein the predictor training 
mechanism determines whether the potential thread of inter- 
est meets at least one criterion of thread goodness of certain 
criteria of thread goodness, and the test is met if the at least 
one criterion of thread goodness is met. 

4. The processor of claim I, wherein the predictor training 
mechanism determines whether the potential thread of inter- 
est meets certain criteria of thread goodness and the test is 
met if the criteria of thread goodness are met. 

5. The processor of claim 1, wherein the test involves 
determining whether a criterion of thread goodness is met, 
which criterion involves making an estimate as to whether 
the potential thread of interesi would be reset to make room 
for another thread earlier in program order if the potential 
thread of interest were created as an actual thread. 

6. The processor of claim I, wherein the test involves 
determining whether a criterion of thread goodness is met, 
which criterion involves determining a number of retired 
instructions between a spawning opportunity instruction and 
a joining opportunity instruction of the potential thread of 
interest, and comparing the number is 10 a threshold value. 

7. The processor of claim 1, wherein the test involves 
determining whether a criterion of thread goodness is met, 
which criterion involves determining a number of instruc- 
tions in the potential thread of interest and comparing the 
number to a threshold value. 

8. The processor of claim 1, wherein the predictor training 
mechanism includes a stack onto which information repre- 
senting potential threads is pushed when a retired spawn 
opportunity instruction is detected and from which informa- 
tion is popped off when a retired joining opportunity instruc- 
tion is detected and the stack includes a distance count 
related to a number of potential threads meeting a criterion 
of thread goodness between a retired spawning opportunity 
instruction and joining opportunity instruction of the poten- 
tial thread of interest. 

9. The processor of claim 8, wherein the predictor training 
mechanism includes a retired instruction counter that pro- 
vides a count presenting the retired instructions and wherein 
the stack further includes fields holding count values for 
spawning opportunity instructions and joining opportunity 
instructions. 

10. A processor comprising: 

thread management logic to control creation of an actual 
thread, the thread management logic including reset 
logic to control whether the actual thread is reset and a 
thread predictor having state machines lo indicate 
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whether thread creation opportunities should be taken 
or not taken, and wherein if the actual thread is reset, 
one of the slate machines associated with the actual 
thread is updated in a not take direction; and 
final retirement logic to control whether the actual thread 
is retired, and wherein if the actual thread is retired, the 
stale machine associated with the actual thread is 
updated in a take direction, the final retirement logic 
including a predictor training mechanism to receive 
retired instructions and to identity potential threads !0 
from the retired instructions and to determine whether 
a potential thread of interest meets a test of thread 
goodness, and if the tesi is met, one of the state 
machines that is associated with the potential thread of 
interest is updated in a take direction, and if the test is 15 
not met, the stale machine associated with the potential 
thread of interest is updated in a not take direction. 

11. 'flic processor of claim 10, wherein the slate machine 
associated with the actual thread is Ihe same slate machine 
associated with the potential thread of interesl if the actual 20 
thread and potential thread of interesl have the same index 
into the slate machine. 

12. The processor of claim 11, wherein the index of the 
actual thread includes bits of an address of the actual thread 
and the index of the potential thread includes bits of an 
address of the potential thread. 

13. The processor of claim 12, wherein more than one 
slate machine is associated with the actual thread and each 
of the slate machines that is associated with the actual thread 
is updated if the actual ihread is reset or retired and wherein 30 
more than one stale machine is associated with the potential 
thread and each of Ihe state machines that is associated with 
the potential thread is updated if the potential thread is or is 
noi a good potential ihread. 

14. The processor of claim 13, wherein the state machines : ° 
include simple and correlated predictors and wherein a 
simple predictor arc indexed through at least some of the 
address bits of the actual thread, thread opportunity, or 
poiential ihread of interest and a correlated predictor is 
indexed through some of the address bits of the actual iQ 
ihread, thread opportunity, or poiential thread of interest and 
bits related to the ihread lineage of ihe actual ihread, thread 
opportunity, or potential thread of interest processed through 
a function. 

1.5. The processor of claim 14, wherein the function is an 45 
exclusive- OR function. 

16. The processor of claim 14, wherein the simple pre- 
diclor is not used after the ihread lineage is established. 

17. The processor of claim 14, wherein if the potential 
thread of interest has the same index as a recently retired or 
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resei actual thread, the final retirement logic does not update 
the associated state machine in response to whether or not 
the potential thread of interest is a good poiential thread. 

18. The processor of claim 14, wherein the state machine 
5 associated with an actual or potential Ihread may be update 

twice, once for the potential ihread and once for a recently 
retired actual ihread with the same index. 

19. The processor of claim 10, wherein the slate machines 
are indexed through at least some of the address bits of the 
thread opportunities. 

20. A processor comprising: 

an execution pipeline to concurrency execute at least 

poriions of actual ihreads; 
detection circuitry to detect speculation errors involving 
ihread dependencies in the execution of the actual 
ihreads; 

trace buffers outside the execution pipeline to hold 

instructions of the actual threads; 
triggering logic to trigger re-execulion of instructions 
from the trace buffers associated with the speculation 
errors; 

thread management logic to control creation of an actual 
ihread, the thread management logic including reset 
logic to control whether the actual thread is reset and a 
ihread prediclor having slate machines to indicate 
whether thread creation opportunities should be taken 
or not taken, and wherein if the actual thread is reset, 
one of the stale machines associated with the actual 
Ihread is updated in a not take direction; and 
final retirement logic to control whether the aclual thread 
is retired, and wherein if the actual thread is retired, the 
state machine associated with the actual thread is 
updated in a take direction. 

21. The processor of claim 20, further comprising; 
a predictor training mechanism to receive retired instruc- 
tions and to identify potential threads from the retired 
instructions and to determine whether a potential ihread 
of interesl meets a tesi of thread goodness, and if the 
test is met, one of the slate machines thai is associated 
with the potential thread of interesl is updated in a take 
direction, and if the test is not met, the slate machine 
associated with the potential ihread of interest is 
updated in a not take direction. 

22. The processor of claim 20, wherein more than one 
state machine is associated with the actual thread and each 
of the slate machines that is associated with the actual thread 
is updated if the actual thread is reset or retired. 
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ABSTRACT 



A method and apparatus for using multiple program 
counters to reduce the latency time of a computer in 
response to an interrupt or subroutine call using a mem- 
ory with multiple memory locations for storing the 
multiple program counters and control means in order 
to choose which one of the memory locations is used as 
a current program counter. Additionally, the use of a 
memory location to store the starting address of the 
interrupt subroutine is also disclosed. 
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.„ , An additional object of the present invention is to 

MINIMAL INTERRUPT LATENCY SCHEME reduce the latency time of a computer in response to a 
USING MULTIPLE PROGRAM COUNTERS subroutine call. 

Background of the Invention 5 SUMMARY OF THE INVENTION 
This application concerns the method and apparatus In accordance with the principles of the present in- 
fer reducing the interrupt latency time vention, the above and other objectives are realized by 
By common design practice, a microprocessor cen- r tilizing a P r °g ram count e r apparatus for reducing the 
tral processing unit (cpu) samples the presence of an ,„ la ' enc >\ t,me °f a computer in response to an interrupt or 
interrupt request close to the end of the last transaction sub 1 rout ' ne ca " where ,he apparatus has a memory with 
of the instruction being executed. The cpu saves the " P ' Urahty ° f mem0ry Ioc t at,ons wh,c r h "» correspond 
interrupted program's next instruction address, cur- '° pr ° 8 T ?° UnterS ' wher < 0ne , of the P 1 "™ 11 ^ of 
rently residing in its program counter (pc), and then ™™ T j!°? = orre f sp f " ds *? J he 8eneral , pr06ram 
loads the starting address of the interrupt service rou- is ^„Th , n T , '^T 
tine into the pc. The new pc value then goes to the ?. orres P ond . to >hernate program counters, and where 
address bus, so that the first" .nstruction oHhe serv c ^fr^Sr^h mem f , r neC,ed ,0 
routine can be fetched for execution. °/*Jh k T 6 ^ mem ° ry ,0Ca "° nS 
xh» a a j 15 used b y the computer as the current program counter. 

hv^-A^V? P £ram < T \ 8e K Cral l y I S T d , ,n Additionally, another aspect of the invention is di- 

?*r j 3 ™™ OT) ! Stack > Z h lr h 15 3 b J° Ck ° f 20 rected to a ™< hod '<* 'educing the latency time of a 

memory utilized as a first-in-last-out buffer, or by stor- computer in response to interrupts from interrupt de- 

ingthe address in a temporary register. vices , This method uses a computer havi a m ^ 

The sampling of the presence of an interrupt request, with mem0ry locations, where the method has the steps 

the saving of the current pc value, the loading of the of establishing a memory location in the memory corre- 

service routine pc value, up to the start of the first fetch 25 spending to the general program counter and memory 

transaction for the service routine, require a certain locations in the memory corresponding to alternate 

amount of time. This time, sometimes expressed as a program counters, prioritizing the interrupts received 

number of cpu clock periods, is defined as the interrupt from interrupt devices to choose an interrupt device 

latency. ^ anc j assoc iated subroutine to be serviced, switching the 

In systems where multiple interrupt request sources memory location used in the computer as a program 

are present, priority resolution has to be performed so counter from the memory location corresponding to the 

that the cpu can determine which interrupt request to general program counter to the memory location corre- 

service. Depending on the system architecture, this sponding to the alternate program counter associated 

priority resolution may or may not add to the interrupt 35 with the chosen interrupt device, executing an interrupt 

latency. subroutine and then switching back to the memory 

At the end of an interrupt service routine, the saved location used by the computer as the program counter 

program address is reloaded into the pc, so that the back from the memory location corresponding to the 

interrupted program can resume its execution. alternate program counter associated with the chosen 

A microprocessor manufactured by Zilog, Inc., the 40 interrupt device to the memory location corresponding 

280180, can be cited as a reference example. The to tne general program counter. 

Z80180 samples its interrupt request 0 input at the cpu The use of multiple program counters reduces the 

clock's falling edge that is 1.5 clock periods earlier than interrupt latency time because the old program counter 

the end of the last transaction of an instruction being value does not have to be stored into a stack and then 

executed. In interrupt mode 1, if the request input is 45 retrieved from a stack, but the computer can merely 

active, the Z80180 would next perform an interrupt switch from using one program counter, for example, 

acknowledge transaction, followed by pushing its cur- the 8 eneral program counter, to another program 

rent pc value to memory stack, done with two memory counter, for example, an alternate program counter 

write transactions, before the first instruction of the associated with an interrupt device. When the subrou- 

interrupt service routine is fetched from a fixed memory 50 tme of the interru P t device is finished, the computer can 

location of hexadecimal 0038. Counting from the above Swltch from the P ro S ram counter associated with the 

mentioned cpu clock's falling edge, the interrupt la- interru P l device back to the general program counter, 

tency is 12.5 clock periods. BRIEF DESCRIPTION OF THE DRAWINGS 

At the end of the interrupt service routine, the „ Tkji „, , . f , 

Z80180 reloads its saved pc value with two memory 55 J^*^^ 

read transactions, accomplished in six clock periods I f nvenn . on ™" b ^ ome more a PP are »< «P°n reading 

Sh^ at r r not mc,ud v he ^^ to £ with 

fetch the return from interrupt instruction that in- T=ir* i v if - J- in- 
vokes the nc value recoverv n5irucuon in FIG. 1 1S a schematic diagram showing the control 

ThV> P n °* 60 means < contro1 Io S ic unil and th * incrementor), the 

Addi lonally, in pnor computer systems on a re- multiple program counter memo the . * de _ 

sponse to a subroutine call, the pc value and other val- vices, and the instruction decode execute unit- 

ues are stored in a stack in a manner similar to the inter- FIG. 2 is a schematic diagram of the control logic 

rupt service as described above. This also creates a time unit; 

delay in the computer processing. 65 FIG. 3A is a timing diagram for the apparatus shown 

It is an object of the present invention to have an in FIGS. 1 and 2; 

interrupt system with a reduced latency time for re- FIG. 3B is a continuation of the timing diagram in 

sponding to an interrupt. FIG. 3A; 
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FIG. 3C is a continuation of the timing diagram in 
FIG. 3B; and 

FIG. 4 is a diagram showing the steps of switching 
between the program counters. 

DETAILED DESCRIPTION 

FIG. 1 is a schematic diagram of one embodiment of 
the present invention. Instead of using a traditional 
program counter, which is a reloadable upcounter, mul- 
tiple program counters are stored in a memory 2. The 10 
memory 2 has memory locations corresponding to mul- 
tiple program counters, the general program counter 
13, and alternate program counters 15, 17, 19. 
PC—INTC 15 is an alternate program counter associ- 
ated with the interrupt device C 5. PC—INTB 17 is an 15 
alternate program counter corresponding to the inter- 
rupt device B 7. PC— INTA 19 is an alternate program 
counter corresponding to the interrupt device A 9. The 
apparatus of the present invention could, of course, 
have more alternate program counters than interrupt 20 
devices, so that the apparatus would be able to add 
additional interrupt devices. In addition, the present 
invention could be used with as many interrupt devices 
as the system designer desires. In one embodiment of 
the present invention, the memory is a dual port RAM 25 
(random access memory). 

STRTC 110, STRTB 112 and STRTA 114 are mem- 
ory locations in the memory 2 that correspond to the 
starting address of the interrupt subroutines that service 
the associated interrupt devices. 30 

A control means 120 controls the use of the program 
counters in the memory 2. The control means 120 con- 
sists of a control logic unit 6 and an incrementor 4. The 
selection of the memory location in the memory 2 used 
as the current program counter is performed by control 35 
logic unit 6. 

The control logic unit 6 also samples the interrupt 
signals from the interrupt devices 5. 7, and 9. Interrupt 
signal INTC from the interrupt device C 5 is sent to the 
control logic unit 6 on line 25. Interrupt signal INTB is 40 
sent to the control logic unit 6 over line 29 from inter- 
rupt device B. Interrupt signal INTA is sent to the 
control logic unit over line 33 by the interrupt device A. 
These three interrupt signals have implicit priority as- 
signments corresponding to the priority of the interrupt 45 
devices. For example, INTA has the highest priority, 
INTB has the second highest priority, and INTC has 
the lowest priority. At close to the end of the last trans- 
action of the instruction being executed, the control 
logic unit captures the logic states of these asynchro- 50 
nous inputs and decides which of the active interrupt 
signals has the highest priority assignment. It then sets 
one of three interrupt under service (ius) flags to reflect 
this resolution. Once an ius flag is set, it inhibits any 
lower priority interrupt signals from being recognized. 55 
The interrupt under service flags can be output from the 
cpu as status indicators, and if so desired, the outputs 
can be properly timed to align their changes to the start 
of the next transaction. The control logic unit 6 sends 
the status indicator IUSA over line 35 to interrupt de- 60 
vice A, the status indicator IUSB over line 31 to inter- 
rupt device B, and the status indicator IUSC over line 
27 to interrupt device C. The interrupt devices can be 
designed to recognize the active going edges of the 
appropriate status indicators to deassert their interrupt 65 
request signals. 

In addition, interrupt nesting is allowed. By common 
design practice, many systems automatically disable 



request input sampling when entering an interrupt ser- 
vice subroutine. The service routine can then enable the 
interrupt request sampling again. The enabling of the 
interrupt request sampling is generally done by execut- 
ing an enable interrupt instruction in the interrupt ser- 
vice subroutine. If an enable interrupt instruction is 
executed in an first interrupt service subroutine and a 
second interrupt request of higher priority occurs, it is 
possible to have more than one ius flag in the active 
state. This is because the corresponding active ius flag 
does not inhibit request inputs from higher priority 
interrupt devices from being sampled. 

Generally, the memory location in memory 2 corre- 
sponding to the highest priority ius flag in effect is read 
to or written from through the port 111. That is, the 
highest priority active ius flag dictates that the corre- 
sponding alternate program counter be read to the in- 
ternal address bus, and the incremented value written 
back into the same memory location. 

If the IUSC flag is the highest priority ius flag set, 
then a read (RD1) from the control logic unit 6 to the 
port 1 11 of memory 2 over line 61 would cause the 
word contained in PC—INTC 15 to be sent across the 
internal address bus 41. In the same step, the word from 
PC—INTC 15 is sent to the incrementor 4. The incre- 
mentor 4 automatically stores the value sent up the 
internal address bus 41 and increments this value so that 
a write operation can use the incremented value. A 
write operation (WR1) over line 63 from the control 
logic unit 6 to the port 1 1 1 over line 63 while the IUSC 
flag is the highest priority ius flag set would cause the 
incremented value from the incrementor 4 to be written 
over bus line 43 into PC-INTC 15. PC-INTC 15, the 
alternate program counter associated with interrupt 
device C, is used as the current program counter while 
IUSC is the highest priority ius flag set. 

If no interrupt under service flag is set, then a RDl 
from the control logic unit 6 across line 61 to port 1 of 
the memory 2 will cause the contents of the general 
program counter pc 13 to be sent across the internal 
address bus 41 and to the incrementor 4. A WR1 signal 
across line 63 from the control logic unit 6 to port 1 11 
of the memory 2 will cause the incremented value from 
the incrementor 4 to be placed into the memory loca- 
tion corresponding to the general program counter pc 
13. 

A special reset highest priority active ius flag instruc- 
tion (RST—HIUS) is used as a last instruction in inter- 
rupt service routine. After this instruction is executed, 
the memory location of the alternate program counter 
corresponding to the next highest priority active ius flag 
would then be used as the current, program counter. 

An exception to the word selection rule occurs to get 
the first address of the interrupt subroutine. For the first 
iteration, when the ius flag goes active, the correspond- 
ing interrupt service routine starting address is read out 
of memory locations 112, 114 or 116 and the incre- 
mented value is written back to the corresponding alter- 
nate program counter memory location 15, 17 and 19, 
In this way, the service routine starting address is pre- 
served for future use. For example, the first time an ius 
flag such as IUSC is set, the subroutine starting address 
memory location for the interrupt subroutine associated 
with the interrupt device C stored in STRTC 110 is sent 
up the internal address bus 41 and sent to the incremen- 
tor 4. Then, when the write signal (WR1) is sent, the 
incremented starting address value is sent across bus 43 
back to PC-INTC 15. Since the contents of the subrou- 
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tine starting address memory location for interrupt de- 
vice C contained in STRTC 110 is not written upon, the 
subroutine starting address is saved. 

Additionally, as discussed above, when none of the 
ius flags are active, the general program counter 13 is 
selected. Words 0 through 6, on lines 65, 67, 69, 71, 73, 
75 and 77 send a signal to a memory location in memory 
2 that controls which memory location is to be read 
from or written to. For example, the first time an ius 
flag such as I USC 8 is set, a signal is sent across word 4 
on line 73 and RD1 on line 61 to cause the contents of 
STRTC 110 to be read and sent across the internal 
address bus 41 and to the incrementor 4. A signal is sent 
across word l t line 67 to PC_INTC 15 and WR1, line 
63 to cause the incremented value from the incrementor 
4 across the bus 43 to be written into PC_INTC 15, the 
memory location corresponding to the alternate pro- 
gram counter associated with the interrupt device C In 
this manner, the word signals, the RD1 signal and the 
WR1 signal, the reading and writing of the memory 
locations in the memory 2 is controlled. 

Port 2 116 of the memory 2 is connected to the inter- 
nal data bus. Under program control, the interrupt ser- 
vice routine starting addresses, which are stored 
STRTC 110, STRTB 112, and STRTA 114, can be 
written through port 2 116. At certain instruction exe- 
cutions, the contents of the memory locations of the 
memory 2 can be modified. With properly designed 
system timing, the two ports 11 and 116 of memory 2 
should not have access conflicts. The reading and writ- 
ing through port 2 116 from the instruction decode/exe- 
cute unit 118 is done with a WR2 over line 57 and a 
RD2 over line 59. For the balance of this discussion, it 
is assumed that interrupt service routines 1 starting ad- 
dresses have been written to STRTA, STRTB, and 
STRTC. 

A system reset on line 23 forces the general program 
counter pc 13 to 0 and the IUSA, IUSB and IUSC flags 
to inactive. 

Examples of interrupt devices used in the present 
invention include such peripherals such as a serial com- 
munications controller (SCC), a counter/timer control- 
ler (CTC), or a parallel input/output port (PIO). 

FIG. 2 is a schematic diagram of the control logic 
unit 6. The IUS SAMPLE block 82 in the control logic 
unit 6 initiates the generation of IUS—SAMPLE pulses 
across line 91, when an EN— INT signal is sent from the 
instruction decode/execute unit 118, shown in FIG. 1, 
across line 53. At the rising edge of an ius signal IUSC, 
IUSB, or IUSA, the IUS SAMPLE block stops gener- 
ating IUS SAMPLE pulses across line 91. Generation 
of IUS—SAMPLE pulses across line 91 can resume 
with the execution of another enable interrupt instruc- 
tion, which activates EN—INT across line 53 from the 
instruction decode/execute unit 118 shown in FIG. 1. 

At the rising edge of the IUS_SAMPLE signal 
across line 91 to the priority resolution block 80, the 
state of the external interrupt request INTC across line 
25 from interrupt device C 5 shown in FIG. 1, INTB 
across line 29 from interrupt device B 7 shown in FIG. 
1, and INTA across line 33 from interrupt device A 9 
are has the highest priority of the interrupt requests or 
is of a higher priority than any active ius flag, a signal is 
sent to the corresponding ius block, by SETC line 103, 
SETB line 101 or SETA line 99, to set the correspond- 
ing ius flags. 

Note that if IUS-SAMPLE is not active, none of the 
SETC ct SETB or SETA signals can go active. 



20 



25 



A signal across the SETC line 103, SETB line 101, or 
SETA line 99 goes to the corresponding ius set flag 
blocks IUSC 8, IUSB 10, IUSA 12, and the correspond- 
ing ius flag goes active at the following edge of the 
5 IUS-SAMPLE. 

When the instruction decode/execute unit 118 shown 
in FIG. 1 executes a reset highest priority active ius flag 
instruction, a RST—HIUS signal is sent across line 55. 
The RESET HIUS block 84 sends a reset signal RSTC 
1° across line 97, RSTB across line 95 or RSTA across line 
. 93, to reset the corresponding ius flag in block IUSA 8, 
IUSB 10, or IUSC 12 on the reset signal's following 
edge based on which ius flag is active and has the high- 
est priority. 

15 Each FETCH signal across line 51 from the instruc- 
tion decode/execute unit 118, shown in FIG. 1, causes 
the RD/WR generator 86 to generate a RD1 pulse 
across line 61. Each RD1 pulse across line 61 is fol- 
lowed by a WR1 pulse across line 63 also generated by 
the RD/WR generator 86. That is, one FETCH signal 
causes both a RD1 and a WR1 pulse. 

For each RD1 or WR1 pulse, the word select block 
88 causes a word pulse to be sent to a memory location. 
That is, word 0 over line 65, word 1 over line 67, word 
2 over line 69, word 3 over line 71, word 4 over line 73, 
word 5 over line 75 or word 6 over line 77 is selected by 
the word select block 88. The word select block 88 
samples the ius flags to determine which word to select. 
3Q As discussed above, the general rule to determine 
which word is selected active is to select the program 
counter memory location associated with the highest 
priority active ius flag. For example, if IUSB is active 
and has the highest priority, the word select block se- 
35 lects word 2 and the alternate program counter associ- 
ated with interrupt device B is used as the current pro- 
gram counter. Two exceptions are word 0 which is 
enabled to write and read from the general program 
counter memory location 13 when none of the ius flags 
4 q are active, and words 4, 5, and 6 which are set at the 
rising edge of the ius flag to get the starting address of 
the corresponding interrupt service routine. Word 4, 
word 5 or word 6 enable the corresponding starting 
address location stored in STRTC 110, STRB 112 or 
45 STRA 114, containing the subroutine starting address 
of the interrupt device being serviced, to be output. 
STRTC 110, STRB 112 or STRA 114 are accessed only 
once at the beginning of each interrupt subroutine. 
FIGS. 3A, 3B and 3C show an example of an event 
50' sequence. For example, looking at FIG. 3A, at time A, 
the RD1 across line 61 and the word 0 across line 65 are 
set so that the memory location of the general program 
counter 13 of the memory 2, shown in FIG. 1, is read 
across the interna] address bus 41 and sent to the incre- 
55 mentor 4. The contents of the internal address bus can 
be used as the address of the instruction to be executed. 
The contents of the internal address bus across line 41 is 
m- 1. This value is sent to the incrementor 4. It is then 
incremented from m — 1 to m. At time B, since the word 
60 0 line 65 and the WRl over line 63 are both set, the 
incremented value from the incrementor 4, (m), is writ- 
ten into the memory location 13 corresponding to the 
general program counter. 

At time C, since the RD1 and word 0 are set, the 
65 contents of the memory location 13 (m) is sent across 
the internal address bus and to the incrementor 4. Soon 
afterward, an INTB signal from the interrupt device B 
7 is sent to the control logic unit 6, and since the 
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IUS— SAMPLE signal is pulsing and no higher priority 
ius flag is active, the ITJSB flag is set. 

Next, at time E, when the RD1 flag and the word 5 
flag are set, the interrupt subroutine starting address 
corresponding to interrupt device B which is stored in 5 
STRB 112 is sent across the internal address bus and to 
the incrementor 4. The incrementor 4 increments this 
value (intbl) to the value intb2. At time F, word 2 
across line 69 and the WR1 signal across line 63 are set, 
so that the value intb2 is written to PC-INTB 17, the 10 
memory location corresponding to the alternate pro- 
gram counter associated with the interrupt device B. In 
this manner, the starting location of the subroutine is 
sent out across the internal address bus, incremented, 
and stored back into the alternate program counter 15 
associated with the interrupt device B. 

FIG. 4 is a diagram of the steps shown in FIGS. 3A, 
3B and 3C At time A, step 1 occurs; Step 1 is a read 
from the general program counter 13. The contents of 
the general program counter is sent across the internal 20 
address bus and to the incrementor 4. At time B, Step 2 
occurs, which is the writing of the incremented value to 
general program counter 13. At time C, step 1 is re- 
peated. At time D, step 2 is repeated. When an interrupt 
occurs, the contents of the memory location corre- 
sponding to the general program counter now contains 
the address of the next instruction to be executed after a 
return from the interrupt. 

Before time E, an INTB interrupt signal is sent by the J0 
interrupt device B to the control logic unit 6. The con- 
trol logic unit 6 sets the IUSB flag at the down edge of 
the IUS-SAMPLE signal. The interrupt device B is 
now the chosen interrupt device. 

At time E, step 3 occurs; step 3 is a read from the ^$ 
memory location 112 which corresponds to the starting 
address of the interrupt subroutine associated with the 
interrupt device B. This starting address is sent across 
the internal address bus and to the incrementor 4, which 
increments the value. Next, at time F, step 4 occurs^ step 40 
4 is a write of the incremented value to PC— INTB 17, 
the memory location corresponding to the alternate 
program counter associated with the interrupt device B. 
At time G, step 5 occurs, which is a read from the 
PC— INTB 17, the contents of which are sent across the 45 
internal address bus and to the incrementor 4. 

Note that at time F and time G an IUS-SAMPLE 
signal does not occur because the subroutine for the 
interrupt device B has not enabled the sampling of the 
other interrupt devices. When the interrupt subroutine 50 
for the first chosen interrupt device (interrupt device B) 
enables the sampling of the other interrupt devices, the 
IUS—SAMPLE signal resumes at time H. A subroutine 
corresponding to an interrupt device need not allow the 
sampling of the other interrupt devices, and such a 55 
subroutine would not be interrupted by other higher 
priority interrupt devices. 

At time H, step 4 repeats so that PC-INTB 17 now 
contains intb3; when the next interrupt subroutine is 
returned from, the alternate program counter corre- 60 
sponding to the interrupt device B contains the memory 
location of the next instruction of the interrupt subrou- 
tine for interrupt device B. During time H, the INTA 
signal is sampled because INT—SAMPLE is active. 
Since INTA is of a higher priority than IUSB, IUSA is 65 
set. Note that both IUSB and IUSA are now set. After 
the next write step occurs, step 4 at time H, the second 
chosen interrupt device may be serviced. 



At time I, step 6 occurs; step 6 is a read of the starting 
address of the subroutine A from STRTA 114 to the 
internal address bus and the incrementor 4. At time J, 
step 7 occurs; step 7 is a write of the incremented value 
to the PC— INTA 19, which is the memory location 
corresponding to the alternate program counter associ- 
ated with the interrupt device A. At time K, step 8 
occurs; step 8 is a read from PC—INTA 19 across the 
interna! address bus and to the incrementor 4. 

Skipping to time 0, the interrupt subroutine A has 
finished, so RST— HIUS is set and IUSA is cleared. 
Since IUSB remains set, the interrupt subroutine for 
interrupt device B can resume by repeating step 5 with- 
out needing to restore any value to the alternate pro- 
gram counter, PC— INTB 17. At time P, step 4 repeats. 

The alternate program counter corresponding to 
interrupt device B (PC— INTB 17) is used as the current 
program counter for a few more steps until time U. 
When the interrupt subroutine for the interrupt device 
B is finished, RST-HIUS is set, IUSB is reset and the 
general program counter is used as the current program 
counter. 

The present invention obviates the steps of saving the 
current pc value and loading the new pc value when 
entering a service subroutine, thereby reducing inter- 
rupt latency. Saved pc value recovery at the end of a 
service routine is eliminated. The scheme is readily 
adaptable to the needs of various cpu architectures, 
including considerations for the instruction sets and 
transaction protocol timing. 

Additionally, the multiple pc scheme can be used for 
regular subroutine calls when non-reentrant program- 
ming is used. Non-reentrant programs are programs 
where a subroutine does not call itself. When non-reen- 
trant programming is used, the present scheme can 
reduce the latency time of switching to a called subrou- 
tine. 

Various details of the implementation and the method 
are merely illustrative of the invention. It will be under- 
stood that various changes in such details may be within 
the scope of the invention, which is to be limited only 
by the appended claims. 

What is claimed is: 

1. A program counter apparatus for reducing the 
latency time of a computer in response to an interrupt, 
said apparatus comprising: 

a memory having memory locations, wherein a gen- 
eral program counter value is located at one of said 
memory locations and alternate program counter 
values are located at a number of other of said 
memory locations; and 

control means connected to said memory for control- 
ling which one of said memory locations contains 
the current program counter value. 

2. The apparatus in claim 1, wherein said control 
means includes a control logic unit and an incrementor, 
said incrementor connected to said memory and said 
control logic unit, wherein said incrementor increments 
the current program counter value, and wherein said 
control logic unit controls the incrementor. 

3. (New) The apparatus of claim 2, wherein said 
memory includes at least one interrupt subroutine start- 
ing address memory location that stores an interrupt 
subroutine starting address and said control means in- 
cludes means for putting the interrupt subroutine start- 
ing address from said interrupt subroutine starting ad- 
dress memory location to the incrementor to be incre- 
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mented and then placing the incremented value into one 
of said alternate program counter value locations. 

4. The apparatus in claim 1, wherein said control 
means is connected to a number of interrupt devices 
which are designed to send interrupt signals to the con- 5 
trol means and wherein said control means includes 
means for using said interrupt signals to select which 
memory location contains the current program counter 
value. 

5. A method for reducing the latency time of a com- 10 
puter in response to interrupts from interrupt devices, 
said computer including a memory with a plurality of 
memory locations, said method comprising the steps of: 

(a) establishing a general program counter value lo- 
cated at a first of said memory locations in said 1 5 
memory, establishing alternate program counter 
values associated with the interrupt devices, said 
alternate program counter values located at other 
memory locations in said memory and setting the 
location of the current program counter value as 20 
the first memory location containing the general 
program counter value; 

<b) prioritizing the interrupts received from the inter- 
rupt devices to choose one of the interrupt devices 
to be serviced by executing its associated subrou- 25 
tine; 

(c) a switching the location of the current program 
counter value from the first memory location con- 
taining the general program counter value to an- 
other memory location containing the alternate 30 
program counter value associated with the chosen 
interrupt device; 

(d) thereafter, executing the interrupt subroutine as- 
sociated with the chosen interrupt device; and 

(e) subsequently, switching the location of the cur- 35 
rent program counter value back from the another 
memory location containing the alternate program 
counter value associated with the chosen interrupt 
device to the first memory location containing the 
general program counter. 40 

6. The method of claim 5, wherein step (c) includes 
the additional step of getting an interrupt subroutine 
starting address from the memory, incrementing this 
starling address, and placing the incremented value in 
the another memory location to be used as the alternate 45 
program counter value associated with the chosen inter- 
rupt device. 

7. The method of claim 5, said method further com- 
prising the following step after step (c) but before step 
(e): 50 

determining whether to service a second interrupt 
device including prioritizing the interrupts re- 
ceived from interrupt devices and comparing them 
to the chosen interrupt device. 

8. The method of claim 7, wherein the additional step 55 
of determining whether to service a second interrupt 
device includes the step of not allowing a service to the 
second interrupt device unless the interrupt subroutine 
associated with the chosen interrupt device enables the 
service of the second interrupt device and the second 60 
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interrupt device has a greater priority than the chosen 
interrupt device. 

9. The method of claim 7, further comprising the 
following steps in order after the determining step: 

switching the location of the current program 
counter value from the another memory location 
containing the alternate program counter value 
associated with the chosen interrupt device to yet 
another memory location containing a second al- 
ternate program counter value associated with the 
chosen second interrupt device; 

executing the interrupt subroutine associated with the 
chosen second interrupt device; 

switching the location of the current program 
counter value back from the yet another memory 
location containing the second alternate program 
counter value associated with the chosen second 
interrupt device to the another memory location 
containing the alternate program counter value 
associated with the chosen interrupt device. 

10. A program counter apparatus for reducing the 
latency time of a computer in response to a subroutine 
call, said apparatus comprising: 

a memory having memory locations, wherein a gen- 
eral program counter value is located at one of said 
memory' locations and alternate program counter 
values are located at a number of other of said 
memory locations; and 

control means connected to said memory for control- 
ling which one of said memory locations contains 
the current program counter value. 

11. The apparatus in claim 10, wherein said control 
means includes means for selecting one of said alternate 
program counter values to be used as the current pro- 
gram counter value in response to a subroutine call from 
an Instruction Decode/Execute unit and means for 
selecting the general program counter value to be used 
as the current program counter value when said In- 
struction Decode/Execute unit finishes executing the 
subroutine. 

12. The apparatus in claim 10, wherein said control 
means includes a control logic unit and an incrementor, 
said incrementor connected to said memory and said 
control logic unit, wherein said incrementor increments 
the current program counter value, and wherein said 
control logic unit controls the incrementor. 

13. The apparatus of claim 12, wherein said memory 
includes at least one subroutine starting address mem- 
ory location that stores a subroutine starting address 
and said control means includes means for putting the 
subroutine starting address from said subroutine starting 
address memory location to the incrementor to be in- 
cremented and then placing the incremented value into 
one of said alternate program counter value locations. 

14. The apparatus of claim 13, wherein said memory 
includes a first port connected to said control means and 
a second port connected to an Instruction Decode/Exe- 
cute unit. 

* * » » 4 
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inventor(S) : Stephen H. Chan 

It is certified that error appears in the above-indentified patent and that said Letters Patent is hereby 
corrected as shown below: 

In Column 8, line 62: 

replace "(New) The apparatus of claim 2, wherein said" 
with: 

— The apparatus of claim 2, wherein said — 

In Column 9, line 27: 

replace " (c) a switching the location of the current 

program" 

with: 

— (c) switching the location of the current program — 
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